CN117333626A

CN117333626A - Image sampling data acquisition method, device, computer equipment and storage medium

Info

Publication number: CN117333626A
Application number: CN202311598977.3A
Authority: CN
Inventors: 胡兰; 张如高; 虞正华
Original assignee: Shenzhen Magic Vision Intelligent Technology Co ltd
Current assignee: Shenzhen Magic Vision Intelligent Technology Co ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2024-01-02
Anticipated expiration: 2043-11-28
Also published as: CN117333626B

Abstract

The invention relates to the technical field of three-dimensional reconstruction, and discloses an image sampling data acquisition method, an image sampling data acquisition device, computer equipment and a storage medium. And inputting the two-dimensional image, the configuration information and the pose information into a monocular depth prediction network model to obtain a predicted depth value of each pixel. And inputting the predicted depth value, the two-dimensional image, the configuration information and the pose information of each pixel into a three-dimensional reconstruction network model to obtain an initial three-dimensional occupation grid. And determining a first direction vector of each pixel according to the pixel value, the configuration information and the pose information of each pixel. A sampling interval is determined from the intersection between the first vector of orientation of each pixel and the voxels in the initial three-dimensional occupancy grid. And sampling the initial three-dimensional space occupying grid in a sampling interval to obtain a sampling set. The invention can accurately sample data and accelerate the subsequent model training process.

Description

Image sampling data acquisition method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of three-dimensional reconstruction technologies, and in particular, to a method and apparatus for acquiring image sampling data, a computer device, and a storage medium.

Background

In the autopilot field, the images taken by the cameras are two-dimensional, while what is needed to train the network model is three-dimensional data. Thus, an initial three-dimensional occupancy grid may be determined from the two-dimensional image and the associated model. Further, sampling is carried out on the initial three-dimensional space occupying grid, and a sampling set is obtained. When sampling is carried out, manual labeling is often needed, and then sampling is carried out in a manually labeled selection frame.

However, the sampling is performed in the manually marked selection frame, and the sampling range is larger. Further, the subsequent model training process is slow, and the accuracy of the trained model is low.

Disclosure of Invention

In view of the above, the present invention provides a method, apparatus, computer device and storage medium for obtaining image sampling data, so as to solve the problems of slower training process and lower accuracy of the model obtained by training caused by a larger sampling range.

In a first aspect, the present invention provides an image sampling data acquisition method, including:

acquiring at least one frame of two-dimensional image corresponding to a target object, and configuration information and pose information of a camera when the two-dimensional image is shot;

Inputting the two-dimensional image, the configuration information and the pose information into a monocular depth prediction network model to obtain a predicted depth value of each pixel in the two-dimensional image;

inputting the predicted depth value, the two-dimensional image, the configuration information and the pose information of each pixel into a three-dimensional reconstruction network model to obtain an initial three-dimensional occupation grid, wherein the initial three-dimensional occupation grid consists of a plurality of voxels;

determining a first direction vector corresponding to each pixel according to the pixel value of each pixel, the configuration information corresponding to the two-dimensional image to which each pixel belongs and the pose information;

determining a sampling interval according to the intersection point between the first direction vector of each pixel and the voxels in the initial three-dimensional occupation grid;

and sampling the initial three-dimensional space occupying grid in the sampling interval to obtain a sampling set.

Specifically, the sampling frame does not need to be manually marked, but the sampling is performed in the sampling interval corresponding to the scene surface more accurately, so that the sampled data volume can be reduced, and the problem of large manual marking error is avoided. In addition, when the data obtained by sampling according to the method is used in the subsequent network model training process, the training time of the network model can be shortened and the training times of the network model can be reduced due to smaller data size and higher accuracy of the sampled data. In addition, the training times are less, the intermediate data and the output data generated in the training process are also less, and the storage space occupation of the computer equipment can be saved.

In an optional implementation manner, the inputting the predicted depth value of each pixel, the two-dimensional image, the configuration information and the pose information into a three-dimensional reconstruction network model to obtain an initial three-dimensional occupation grid includes:

according to the pose information, determining a three-dimensional space corresponding to the two-dimensional image;

discretizing the three-dimensional space to obtain voxels corresponding to the three-dimensional space;

carrying out projection transformation on each voxel to obtain a projection depth value of a corresponding pixel of each voxel in a two-dimensional space;

determining a plane formed by voxels corresponding to pixels with the same projection depth value and the same predicted depth value as an initial surface of the target object;

calculating the distance from each voxel to the initial surface;

and determining the initial three-dimensional space occupying grid according to the distance from each voxel to the initial surface and a preset distance threshold, wherein a plurality of voxels forming the initial three-dimensional space occupying grid are part of voxels in the voxels corresponding to the three-dimensional space.

Specifically, the depth value of each pixel is determined in two different manners, and if the depth values determined in two different manners are different, the error of the calculation result of the depth value of the corresponding pixel is larger, and the voxel corresponding to the pixel is not suitable for being used as an initial surface. Otherwise, the error of the depth value calculation result of the corresponding pixel is small, and the depth value calculation result can be used as an initial surface. Therefore, the voxels corresponding to the pixels with the same depth values determined in the two modes are used as the initial surface of the target object, and the initial surface can be accurately determined. Further, from the calculated distance of each voxel to the initial surface, a voxel closer to the initial surface may be determined. The voxels are used as an initial three-dimensional space occupying grid, so that the three-dimensional characteristics of the target object can be accurately represented.

In an optional implementation manner, the determining, according to the pixel value of each pixel, the configuration information and the pose information corresponding to the two-dimensional image to which each pixel belongs, the first direction vector corresponding to each pixel specifically includes:

performing back projection transformation on each pixel according to the pixel value of each pixel and the configuration information to obtain a second azimuth vector of each pixel, wherein the second azimuth vector is a vector under a camera coordinate system;

and carrying out coordinate transformation on the second azimuth vector of each pixel according to the pixel value of each pixel and the pose information to obtain the first azimuth vector, wherein the first azimuth vector is a vector under a world coordinate system.

Specifically, through back projection transformation and coordinate transformation, vectors under a world coordinate system can be obtained, namely, the second azimuth vectors can be unified to the three-dimensional space of the initial three-dimensional space occupation grid, and the subsequent determination of the sampling interval is facilitated.

In an alternative embodiment, the determining the sampling interval according to the intersection point between the first direction vector of each pixel and the voxel in the initial three-dimensional occupation grid specifically includes:

When the first orientation vector penetrates through the initial three-dimensional space-occupying grid to generate two intersection points, respectively determining the two intersection points as a starting point voxel and an end point voxel;

and determining a sampling interval according to the predicted depth value of the pixel corresponding to the starting point voxel and the predicted depth value of the pixel corresponding to the ending point voxel.

Specifically, a depth range with stronger correlation with the two-dimensional image can be determined as a sampling interval through predicted depth values respectively corresponding to a closest point and a farthest point of the intersection of the first azimuth vector and the initial three-dimensional space grid. And the sampling is carried out in the sampling interval, so that more accurate sampling data can be obtained. And when the subsequent network model is trained by using less sampling data, the whole training process can be quickened, and processing resources and storage resources are saved.

In an alternative embodiment, the method further comprises:

in a first training period, inputting the sampling set into a neural radiation field network model to be trained;

performing iterative training on the neural radiation field network model to be trained;

when the neural radiation field network model meets the training stopping condition according to the estimated color value and the pre-acquired original color value of the pixel corresponding to each voxel output by the neural radiation field network model, determining the neural radiation field network model which is trained as a first neural radiation field network model corresponding to the first training period, wherein the original color value is the original color value of the pixel corresponding to the estimated color value.

Specifically, the training process of the period can be accelerated by training the NERF network model by using a sampling set with less sampling data.

In an alternative embodiment, the method further comprises:

in a second training period, inputting voxels corresponding to an ith three-dimensional space-occupying grid into the ith neural radiation field network model, performing ith iterative training on the ith neural radiation field network model, and obtaining an estimated color value of each pixel corresponding to each voxel output by the ith neural radiation field network model and a weight value corresponding to each voxel, wherein i is a positive integer, and when i is 1, the ith three-dimensional space-occupying grid is the initial three-dimensional space-occupying grid;

according to the estimated color value of the pixel corresponding to each voxel and the original color value corresponding to each voxel output by the ith neural radiation field network model, parameters of the ith neural radiation field network model are optimized, and an (i+1) th neural radiation field network model is obtained;

and screening voxels with a weight value larger than a preset threshold value from voxels output by the ith nerve radiation field network model, constructing an (i+1) th three-dimensional space-occupying grid, stopping training until the value of the (i+1) is a preset frequency threshold value, and determining the (i+1) th three-dimensional space-occupying grid as a target three-dimensional space-occupying grid.

Specifically, by continuously filtering voxels in the initial three-dimensional footprint grid, the target three-dimensional footprint grid composed of voxels according to the final filtering is made more similar to the actual three-dimensional model. The neural radiation field network model is trained through the iteratively updated three-dimensional occupation grid, the accuracy of the neural radiation field network model can be verified, and the network parameters of the neural radiation field network model are updated through the verification result. The training data is updated continuously, and the neural radiation field network model is trained by the continuously updated training data, so that an accurate neural radiation field network model can be obtained, and a more accurate target three-dimensional space-occupying grid can be obtained.

In an alternative embodiment, the method further comprises:

acquiring new pose information corresponding to the camera;

inputting the new pose information and the target three-dimensional occupation grid into an i+1-th nerve radiation field network model to obtain a two-dimensional image corresponding to the new pose information, wherein i+1 is equal to the preset frequency threshold.

Specifically, the camera is not required to be used for photographing again, so that a two-dimensional image of the target object at a new view angle can be obtained, and convenience is brought.

In a second aspect, the present invention provides an image sample data acquisition apparatus, the apparatus comprising:

the acquisition module is used for acquiring at least one frame of two-dimensional image corresponding to the target object, and configuration information and pose information of a camera when the two-dimensional image is shot;

the model processing model is used for inputting the two-dimensional image, the configuration information and the pose information into a monocular depth prediction network model to obtain a predicted depth value of each pixel in the two-dimensional image; inputting the predicted depth value, the two-dimensional image, the configuration information and the pose information of each pixel into a three-dimensional reconstruction network model to obtain an initial three-dimensional occupation grid, wherein the initial three-dimensional occupation grid consists of a plurality of voxels;

the determining module is used for determining a first direction vector corresponding to each pixel according to the pixel value of each pixel, the configuration information and the pose information corresponding to the two-dimensional image to which each pixel belongs; determining a sampling interval according to the intersection point between the first direction vector of each pixel and the voxels in the initial three-dimensional occupation grid;

and the sampling module is used for sampling the initial three-dimensional space occupying grid in the sampling interval to obtain a sampling set.

In a third aspect, the present invention provides a computer device comprising: the image sampling data acquisition device comprises a memory and a processor, wherein the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions, so that the image sampling data acquisition method of the first aspect or any corresponding implementation mode of the first aspect is executed.

In a fourth aspect, the present invention provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the image sample data acquisition method of the first aspect or any one of its corresponding embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for acquiring image sample data according to an embodiment of the present invention;

FIG. 2 is a flow chart of another image sample data acquisition method according to an embodiment of the present invention;

FIG. 3 is a flow chart diagram of a method of training a NERF network model in accordance with an embodiment of the present invention;

FIG. 4 is a flow chart of a method of applying NERF network models in accordance with an embodiment of the present invention;

fig. 5 is a block diagram of a structure of an image sample data acquiring apparatus according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the driving assisting scene, the perception module based on deep learning is one of main functional modules and is used for collecting information in the environment, and outputting the collected information to functional modules such as a positioning module, a decision module and the like for corresponding processing. The function implementation of each function module is realized based on a network model, and the accurate network model is obtained by training by using a large amount of data.

In the automatic driving field, the image shot by the camera is two-dimensional, and some three-dimensional information is required for training the network model, so that the three-dimensional information calibration is required for the two-dimensional image before the related network model is trained. For example, the three-dimensional information may be three-dimensional frame information, semantic segmentation information of the vehicle. The error of calibrating the three-dimensional information on the two-dimensional data is larger. Therefore, in the related technology, three-dimensional data is obtained according to two-dimensional data and a three-dimensional model reconstruction algorithm, and then three-dimensional information is calibrated on the three-dimensional data. And projecting the three-dimensional data after the three-dimensional information is calibrated onto a two-dimensional image, and training the related network model by using the two-dimensional image after the three-dimensional data is projected.

In performing three-dimensional model reconstruction, three-dimensional reconstruction may be performed using a three-dimensional point cloud, for example, lidar point cloud, but the three-dimensional point cloud is sparse, which may result in inaccuracy of the reconstructed three-dimensional model. Alternatively, a neural radiation field (Neural Radiance Field, NERF) network model is used for three-dimensional reconstruction, but when the neural radiation field network model is used for three-dimensional reconstruction in the related art, a three-dimensional space-occupying grid is generally created, and artificial labeling is performed on the three-dimensional space-occupying grid, for example, artificial labeling is performed as frame selection. And taking points in the frame selection range as sampling points, and training the neural radiation field network model. Therefore, the accuracy of the training model is lower through manual labeling, and the number of points selected by the manual labeling is large, so that the efficiency of the subsequent model training process is lower.

According to an embodiment of the present invention, there is provided an image sample data acquisition method embodiment, it being noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in an order other than that shown or described herein.

In this embodiment, an image sampling data obtaining method is provided, which may be used in a computer device, and fig. 1 is a flowchart of the image sampling data obtaining method according to an embodiment of the present invention, as shown in fig. 1, where the flowchart includes the following steps:

step S101, acquiring at least one frame of two-dimensional image corresponding to the target object, and configuration information and pose information of the camera when capturing the two-dimensional image.

The target object may be a specific scene, for example, a road scene, a parking lot scene, etc., or may be a specific object, for example, a desk, a computer, etc.

Specifically, when the target object is photographed, multiple frames of two-dimensional images can be photographed under the same configuration information and pose information, or multiple frames of two-dimensional images can be photographed under different configuration information or different pose information. In this way, the computer device may acquire at least one frame of two-dimensional image corresponding to the target object. The computer device may record configuration information and pose information of the camera corresponding to the two-dimensional image when each frame of the two-dimensional image is photographed.

Step S102, inputting the two-dimensional image, the configuration information and the pose information into a monocular depth prediction network model to obtain a predicted depth value of each pixel in the two-dimensional image.

The monocular depth prediction network model may be a machine learning model, for example, a neural tree network model.

Specifically, when the two-dimensional image has multiple frames, configuration information and pose information of each frame of the two-dimensional image and the camera when the two-dimensional image is shot can be respectively input into a monocular depth prediction network model, so as to obtain a predicted depth value of each pixel in each frame of the two-dimensional image.

And step S103, inputting the predicted depth value, the two-dimensional image, the configuration information and the pose information of each pixel into a three-dimensional reconstruction network model to obtain an initial three-dimensional occupation grid.

The three-dimensional reconstructed network model may be a machine learning model, for example, a truncated signed distance function (Truncated Signed Distance Function, TSDF) model, a signed distance function (Signed Distance Function, SDF) model, or the like. The initial three-dimensional occupancy grid is made up of a plurality of voxels.

Specifically, when there are multiple frames of two-dimensional images, for each frame of two-dimensional image, the predicted depth value of each pixel in the two-dimensional image, and configuration information and pose information of a camera when the two-dimensional image is shot can be input into a three-dimensional reconstruction network model to obtain an initial three-dimensional space grid.

Step S104, determining a first direction vector corresponding to each pixel according to the pixel value of each pixel, the configuration information and the pose information corresponding to the two-dimensional image to which each pixel belongs.

Step S105, determining a sampling interval according to the intersection point between the first direction vector of each pixel and the voxels in the initial three-dimensional occupation grid.

Specifically, when the two-dimensional image is a single frame, the unit sampling interval corresponding to each pixel may be determined according to an intersection point between the first direction vector of each pixel of the two-dimensional image and the voxel in the initial occupancy vignette. The unit sampling interval corresponding to each pixel constitutes a sampling interval. When the two-dimensional image is multi-frame, for the pixels belonging to the same position of different two-dimensional images, a plurality of unit sampling intervals corresponding to the pixels can be processed to obtain a new unit sampling interval, and the specific processing can be: and carrying out weighted calculation on the left section end point values of the plurality of sampling unit sections to obtain a new left section end point value, carrying out weighted calculation on the right section end point values of the plurality of sampling unit sections to obtain a new right section end point value, and taking the new left section and the new right section as new sampling unit sections corresponding to the pixels. The new unit sampling intervals corresponding to the pixels at all positions constitute the sampling interval.

Therefore, for different situations, the sampling interval can be determined according to corresponding processing, and the subsequent more accurate data sampling is facilitated.

And S106, sampling the initial three-dimensional space occupying grid in a sampling interval to obtain a sampling set.

Specifically, the sampling may be performed using a sampling method such as multi-stage sampling, average sampling, weighted sampling, or the like.

According to the image sampling data acquisition method provided by the embodiment, the computer equipment can determine the predicted depth value of each pixel according to the shot two-dimensional image, and the configuration information and the pose information of the camera when the two-dimensional image is shot. The predicted depth value may represent three-dimensional information corresponding to the two-dimensional image. Further, according to the predicted three-dimensional information and the two-dimensional image, and the configuration information and pose information of the camera when the two-dimensional image is shot, a three-dimensional initial three-dimensional space occupying grid can be constructed, and the initial three-dimensional space occupying grid can represent three-dimensional surface information of a shot target object. In addition, the computer device may determine, according to the pixel value of each pixel, the configuration information and pose information corresponding to the two-dimensional image to which each pixel belongs, a ray line when the camera photographs the target object, that is, the first direction vector of each pixel. And then determining a sampling interval according to the intersection point of the initial three-dimensional space occupying grid and the first direction vector of the pixels. In this way, the determined sampling interval is a sampling interval near the scene surface of the target object. According to the scheme, the sampling frame does not need to be manually marked, but the sampling is accurately performed in the sampling interval corresponding to the scene surface, so that the data volume of sampling can be reduced, and the problem of large manual marking error is avoided. In addition, when the data obtained by sampling according to the method is used in the subsequent network model training process, the training time of the network model can be shortened and the training times of the network model can be reduced due to smaller data size and higher accuracy of the sampled data. In addition, the training times are less, the intermediate data and the output data generated in the training process are also less, and the storage space occupation of the computer equipment can be saved.

In this embodiment, an image sampling data obtaining method is provided, which may be used in a computer device, and fig. 2 is a flowchart of the image sampling data obtaining method according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

step S201, acquiring at least one frame of two-dimensional image corresponding to the target object, and configuration information and pose information of the camera when capturing the two-dimensional image.

Step S202, inputting the two-dimensional image, the configuration information and the pose information into a monocular depth prediction network model to obtain a predicted depth value of each pixel in the two-dimensional image.

The specific processing of step S201 to step S202 is similar to that of step S101 to step S102, and will not be described here.

And step S203, inputting the predicted depth value, the two-dimensional image, the configuration information and the pose information of each pixel into a three-dimensional reconstruction network model to obtain an initial three-dimensional occupation grid.

Wherein the initial three-dimensional occupancy grid is composed of a plurality of voxels.

Specifically, the step S203 includes:

step S2031, determining a three-dimensional space corresponding to the two-dimensional image according to the pose information. Step S2032, discretizing the three-dimensional space to obtain voxels corresponding to the three-dimensional space.

In step S2033, each voxel is subjected to projective transformation, so as to obtain a projective depth value of the corresponding pixel of each voxel in the two-dimensional space.

In step S2034, a plane formed by voxels corresponding to pixels having the same projection depth value and prediction depth value is determined as the initial surface of the target object.

In step S2035, the distance of each voxel to the initial surface is calculated.

Step S2036, determining an initial three-dimensional occupancy grid according to the distance from each voxel to the initial surface and the preset distance threshold.

The plurality of voxels forming the initial three-dimensional space occupying grid are partial voxels in the voxels corresponding to the three-dimensional space. The preset distance threshold may be a side length or a diagonal length of one voxel.

In particular, the computer device may determine voxels having a distance to the initial surface less than a preset distance threshold as voxels comprising an initial three-dimensional occupancy grid.

The depth value of each pixel is determined in two different modes, and if the determined depth values are different, the error of the calculation result of the depth value of the corresponding pixel is larger, and the voxel corresponding to the pixel is not suitable for being used as an initial surface. Otherwise, the error of the depth value calculation result of the corresponding pixel is small, and the depth value calculation result can be used as an initial surface. Therefore, the voxels corresponding to the pixels with the same depth values determined in the two modes are used as the initial surface of the target object, and the initial surface can be accurately determined. Further, from the calculated distance of each voxel to the initial surface, a voxel closer to the initial surface may be determined. The voxels are used as an initial three-dimensional space occupying grid, so that the three-dimensional characteristics of the target object can be accurately represented.

Step S204, determining a first direction vector corresponding to each pixel according to the pixel value of each pixel, the configuration information and the pose information corresponding to the two-dimensional image to which each pixel belongs.

Specifically, the step S204 includes:

in step S2041, back projection transformation is performed on each pixel according to the pixel value and the configuration information of each pixel, so as to obtain a second azimuth vector of each pixel.

The second azimuth vector is a vector under a camera coordinate system.

Step S2042, according to the pixel value and pose information of each pixel, carrying out coordinate transformation on the second azimuth vector of each pixel to obtain a first azimuth vector.

Wherein the first direction vector is a vector in the world coordinate system.

In this way, the vector under the world coordinate system can be obtained through back projection transformation and coordinate transformation, namely, the second azimuth vector can be unified to the three-dimensional space of the initial three-dimensional space occupation grid, and the subsequent determination of the sampling interval is facilitated.

In step S205, a sampling interval is determined according to the intersection point between the first vector of the first direction of each pixel and the voxels in the initial three-dimensional occupancy grid.

Specifically, the step S205 includes:

in step S2051, when the first vector of directions crosses the initial three-dimensional space grid to generate two intersecting points, the two intersecting points are determined as the start voxel and the end voxel, respectively.

Specifically, for a first orientation vector of each pixel, among a plurality of voxels intersecting an initial three-dimensional occupancy grid, a start voxel and an end voxel at which the first orientation vector of each pixel intersects the initial occupancy grid are determined.

Step S2052, determining a sampling interval according to the predicted depth value of the pixel corresponding to the start voxel and the predicted depth value of the pixel corresponding to the end voxel.

Specifically, from the calculation result of step S202, the predicted depth value of the start voxel and the predicted depth value of the end voxel are acquired. And determining the predicted depth value of the starting point voxel as a left interval endpoint value, and determining the predicted depth value of the ending point voxel as a right interval endpoint value, wherein the left interval endpoint value and the right interval endpoint value form a unit sampling interval. In this way, a unit sampling interval for each first direction vector is obtained. The multiple cell sampling intervals constitute the total sampling interval of the initial placeholder grid. For example, the predicted depth value of the start voxel may be denoted as Near, the predicted depth value of the end voxel may be denoted as Far, and the sampling interval may be denoted as [ Near, far ].

Step S206, sampling the initial three-dimensional space-occupying grid in a sampling interval to obtain a sampling set.

According to the image sampling data acquisition method, as the number of voxels of the initial three-dimensional space grid is large, if all voxels of the initial three-dimensional space grid are used as sampling points for subsequent network model training, the iteration times in the training process are large, the network model converges slowly, and intermediate data and output data generated in the training process are large. Therefore, a depth range with stronger correlation with the two-dimensional image can be determined as a sampling interval through the predicted depth values respectively corresponding to the closest point and the farthest point of the intersection of the first direction vector and the initial three-dimensional space grid. And the sampling is carried out in the sampling interval, so that more accurate sampling data can be obtained. And when the subsequent network model is trained by using less sampling data, the whole training process can be quickened, and processing resources and storage resources are saved.

In this embodiment, a method for training a NERF network model is also provided, and training data used for training the NERF network model may be obtained by the above-mentioned image sampling data acquisition method. The training may be divided into two cycles of iterative training, the specific processing of the first training cycle is shown in steps S301 to S303, and the specific processing of the second training cycle is shown in steps S304 to S306. The method can be used in a computer device, and FIG. 3 is a flow chart of a method for training NERF network models according to an embodiment of the invention, as shown in FIG. 3, the flow comprising the steps of:

Step S301, in a first training period, a set of samples is input into the NERF network model to be trained.

The NERF network model can be classical NERF, hash-NERF, mich-NERF and the like. The network framework of the NERF network model may be Multi-Layer awareness (MLP).

Step S302, iterative training is carried out on the NERF network model to be trained.

Step S303, when the NERF network model is determined to meet the training stopping condition according to the estimated color value of the pixel corresponding to each voxel output by the NERF network model and the pre-acquired original color value, determining the NERF network model after training as a first NERF network model corresponding to the first training period.

The original color value is the original color value of the pixel corresponding to the estimated color value.

Specifically, the loss function may take the form of a 1-norm, a 2-norm, etc. when training the NERF network model. For each voxel, the computer device may calculate an absolute value of the difference between the pre-estimated color value and the pre-acquired original color value for the pixel corresponding to the voxel. And when the absolute value of the difference value between the estimated color value of the pixel corresponding to each voxel and the pre-acquired original color value tends to be stable, determining that the NERF network model meets the training stopping condition. Alternatively, the computer device may further sum the absolute value of the difference between the estimated color value of the pixel corresponding to each voxel and the pre-obtained original color value to obtain a loss value, and when the loss value tends to be stable, determine that the NERF network model meets the training stopping condition. Alternatively still, the computer device may stop training for the first training period when the number of training iterations reaches a specified number.

In the first training period, the NERF network model is trained by using the sampling set with less sampling data, so that the training process of the period can be quickened.

Step S304, in the second training period, the voxels corresponding to the ith three-dimensional space-occupying grid are input into the ith NERF network model, the ith iteration training is carried out on the ith NERF network, and the estimated color value of the pixel corresponding to each voxel and the weight value corresponding to each voxel output by the ith NERF network model are obtained.

Wherein i is a positive integer, and when i is 1, the i-th three-dimensional space grid is an initial three-dimensional space grid.

Step S305, according to the estimated color value of the pixel corresponding to each voxel and the original color value corresponding to each voxel output by the ith NERF network model, the parameters of the ith NERF network model are optimized to obtain the (i+1) th NERF network model.

And step S306, selecting voxels with a weight value larger than a preset threshold value from voxels output by the ith NERF network model, constructing the (i+1) th three-dimensional space-occupying grid, stopping training until the value of the (i+1) is a preset frequency threshold value, and determining the (i+1) th three-dimensional space-occupying grid as a target three-dimensional space-occupying grid.

Specifically, during the first training in the second training period (when i=1), the initial three-dimensional space-occupying grid (i.e. the first three-dimensional space-occupying grid) may be input into the first NERF network model (i.e. the first NERF network model), to obtain a predicted color value of a pixel corresponding to each pixel in the initial three-dimensional space-occupying grid, further, an absolute value of a difference value between the predicted color value of the pixel and the pre-obtained original color value is calculated, and then, a loss value is calculated according to the loss function and the absolute value of the difference value corresponding to each pixel. And the first NERF network model adjusts parameters according to the loss value obtained by the first training to obtain a second NERF network model.

In addition, the first NERF network model may further calculate a weight value corresponding to each voxel in the initial three-dimensional space grid according to bayesian theory, and further determine a voxel with a weight value greater than a preset threshold as the second three-dimensional space grid.

And when the second training is carried out in the second training period (when i=2), inputting a second three-dimensional occupation grid into a second NERF network model to obtain the estimated color value of each pixel corresponding to the pixel in the second three-dimensional occupation grid, further calculating the absolute value of the difference value between the estimated color value of the pixel and the pre-obtained original color value, and then calculating the loss value according to the loss function and the absolute value of the difference value corresponding to each pixel. And the second NERF network model adjusts the parameters according to the loss value obtained by the second training to obtain a third NERF network model.

In addition, the second NERF network model may further calculate a weight value corresponding to each voxel in the second three-dimensional space grid according to bayesian theory, and further determine a voxel with a weight value greater than a preset threshold as a third three-dimensional space grid.

And so on, performing subsequent training. Stopping training until the training times reach a preset time threshold, and determining the voxels screened after the last training as a target three-dimensional occupation grid, namely a reconstructed three-dimensional model. Or stopping training until the loss value determined according to the preset color value of the pixel corresponding to each voxel output by the (i+1) th NERF network model tends to be stable.

In the second training period, the trained first NERF network model and all voxels in the initial three-dimensional occupation grid are used for training the first NERF network model, so that training can be performed efficiently, and a more accurate NERF network model (namely a second NERF network model) can be obtained.

In other possible implementations, in step S304, the weight value of each voxel may be recorded and counted, and when the number of times that the weight value of a certain voxel is less than or equal to the preset threshold reaches the designated number of times, the voxel is removed, and a new three-dimensional space-occupying grid is constructed according to other voxels after the voxel is removed.

Thus, the three-dimensional space occupying grid is not required to be updated every time of training, and the processing resources can be saved. In addition, in multiple training, the weight value of the voxel is smaller than or equal to a preset threshold value, so that the probability that the voxel is occupied is extremely low, the voxel can be removed, and the target three-dimensional occupied grid can be determined more accurately.

The method for training the NERF network model provided by the embodiment is difficult to converge because of the longer training period of the NERF network model. Therefore, the method adopts a staged training method to train the NERF network model. In a first training period, the NERF network model is initially trained using data sampled in a precise sampling interval. Because less data is used and the sampling interval is more accurate, the NERF network model can easily achieve convergence through the training of the first training period. And in a second training period, iteratively training the NERF network model again by using all data of the initial three-dimensional space-occupying grid. Although the current training samples have more data, a more accurate NERF network model is obtained after the training of the first training period. Thus, the training process of the second training period can reach the convergence condition faster than training the NERF network model without any training directly using all the data of the initial three-dimensional occupancy grid. And since the updated occupancy grid is used for each training during the second training period, the updated occupancy grid is again from the previous training. Therefore, updating the occupancy grid may effectively expedite the training process. In addition, the second training period uses all data of the initial three-dimensional space-occupying grid, the characteristics of all data can be comprehensively considered, all data are used for training, and an accurate NERF network model can be obviously obtained. Furthermore, the NERF network model trained through two training periods is used for three-dimensional reconstruction, so that a more accurate three-dimensional model can be obtained.

In this embodiment, a method for applying a NERF network model is provided, which may be used in a computer device, and fig. 4 is a flowchart of a method for applying a NERF network model according to an embodiment of the present invention, as shown in fig. 4, where the flowchart includes the following steps:

step S401, new pose information corresponding to the camera is acquired.

Step S402, inputting the new pose information and the target three-dimensional space occupying grid into a second NERF network model to obtain a two-dimensional image corresponding to the new pose information.

Specifically, the user can input new pose information of the camera, and after detecting an input operation instruction of the user, the computer equipment inputs the new pose information and the target three-dimensional occupation grid in the operation instruction into the second NERF network model. The second NERF network model can determine a two-dimensional image corresponding to the new pose information according to the new pose information and the target three-dimensional space occupying grid.

According to the method for applying the NERF network model, a camera is not required to be used for photographing again, and a two-dimensional image of the target object under a new view angle can be obtained.

The embodiment also provides an image sampling data acquiring device, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides an image sampling data acquisition apparatus, as shown in fig. 5, including:

an obtaining module 501, configured to obtain at least one frame of two-dimensional image corresponding to a target object, and configuration information and pose information of a camera when the two-dimensional image is captured;

the model processing module 502 is configured to input the two-dimensional image, the configuration information and the pose information into a monocular depth prediction network model to obtain a predicted depth value of each pixel in the two-dimensional image; inputting the predicted depth value, the two-dimensional image, the configuration information and the pose information of each pixel into a three-dimensional reconstruction network model to obtain an initial three-dimensional occupation grid, wherein the initial three-dimensional occupation grid consists of a plurality of voxels;

a determining module 503, configured to determine a first direction vector corresponding to each pixel according to a pixel value of each pixel, configuration information and pose information corresponding to a two-dimensional image to which each pixel belongs; determining a sampling interval according to the intersection point between the first direction vector of each pixel and the voxels in the initial three-dimensional space-occupying grid;

the sampling module 504 is configured to sample the initial three-dimensional space-occupying grid in a sampling interval to obtain a sampling set.

In an alternative embodiment, the model processing module 502 is specifically configured to:

Determining a three-dimensional space corresponding to the two-dimensional image according to the pose information;

determining a plane formed by voxels corresponding to pixels with equal projection depth values and predicted depth values as an initial surface of a target object;

calculating the distance from each voxel to the initial surface;

and determining an initial three-dimensional space occupying grid according to the distance from each voxel to the initial surface and a preset distance threshold value, wherein a plurality of voxels forming the initial three-dimensional space occupying grid are partial voxels in voxels corresponding to the three-dimensional space.

In an alternative embodiment, the determining module 503 is specifically configured to:

carrying out back projection transformation on each pixel according to the pixel value and the configuration information of each pixel to obtain a second azimuth vector of each pixel, wherein the second azimuth vector is a vector under a camera coordinate system;

and carrying out coordinate transformation on the second azimuth vector of each pixel according to the pixel value and the pose information of each pixel to obtain a first azimuth vector, wherein the first azimuth vector is a vector under a world coordinate system.

when the first orientation vector penetrates through the initial three-dimensional occupation grid to generate two intersection points, respectively determining the two intersection points as a starting point voxel and an end point voxel;

In an alternative embodiment, the apparatus further comprises an input module 505 and a training module 506:

an input module 505, configured to input, in a first training period, a sampling set into a neural radiation field network model to be trained;

the training module 506 is configured to perform iterative training on a neural radiation field network model to be trained;

and the determining module 503 is configured to determine, when the neural radiation field network model meets the training stopping condition according to the estimated color value and the pre-acquired original color value of the pixel corresponding to each voxel output by the neural radiation field network model, the neural radiation field network model after training is determined to be the first neural radiation field network model corresponding to the first training period, where the original color value is the original color value of the pixel corresponding to the estimated color value.

In an alternative embodiment, the apparatus further comprises a tuning module 507 and a construction module 508:

the obtaining module 501 is configured to input a voxel corresponding to the ith three-dimensional space-occupying grid into the ith neural radiation field network model in a second training period, perform an ith iterative training on the ith neural radiation field network model, and obtain an estimated color value of a pixel corresponding to each voxel output by the ith neural radiation field network model, and a weight value corresponding to each voxel, where i is a positive integer, and when i is 1, the ith three-dimensional space-occupying grid is an initial three-dimensional space-occupying grid;

the tuning module 507 is configured to tune parameters of the ith neural radiation field network model according to the estimated color value of the pixel corresponding to each voxel and the original color value corresponding to each voxel output by the ith neural radiation field network model, and obtain an (i+1) th neural radiation field network model;

and the construction module 508 is used for screening voxels with a weight value larger than a preset threshold value from voxels output by the ith neural radiation field network model, constructing an (i+1) th three-dimensional space-occupying grid until training is stopped when the value of the (i+1) is a preset frequency threshold value, and determining the (i+1) th three-dimensional space-occupying grid as a target three-dimensional space-occupying grid.

In an alternative embodiment, the obtaining module 501 is further configured to obtain new pose information corresponding to the camera;

the model processing module 502 is further configured to input the new pose information and the target three-dimensional space-occupying grid into the i+1th neural radiation field network model, so as to obtain a two-dimensional image corresponding to the new pose information, where i+1 is equal to a preset frequency threshold.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The image sample data acquisition device in this embodiment is presented in the form of functional units, where the units refer to ASIC (Application Specific Integrated Circuit ) circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above described functionality.

The embodiment of the invention also provides computer equipment, which is provided with the image sampling data acquisition device shown in the figure 5.

Referring to fig. 6, fig. 6 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 6, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 6.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device further comprises input means 30 and output means 40. The processor 10, memory 20, input device 30, and output device 40 may be connected by a bus or other means, for example in fig. 6.

The input device 30 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the computer apparatus, such as a touch screen, keypad, mouse, trackpad, touchpad, etc. The output means 40 may comprise a display device or the like. Such display devices include, but are not limited to, liquid crystal displays, light emitting diodes, displays and plasma displays. In some alternative implementations, the display device may be a touch screen.

Claims

1. A method of acquiring image sample data, the method comprising:

2. The method according to claim 1, wherein inputting the predicted depth value, the two-dimensional image, the configuration information, and the pose information of each pixel into a three-dimensional reconstruction network model to obtain an initial three-dimensional occupancy grid comprises:

calculating the distance from each voxel to the initial surface;

3. The method according to claim 1 or 2, wherein the determining the first direction vector corresponding to each pixel according to the pixel value of each pixel, the configuration information corresponding to the two-dimensional image to which each pixel belongs, and the pose information specifically includes:

4. The method according to claim 1 or 2, wherein determining the sampling interval from the intersection between the first vector of orientation of each pixel and the voxels in the initial three-dimensional occupancy grid comprises:

5. The method according to claim 1 or 2, characterized in that the method further comprises:

6. The method of claim 5, wherein the method further comprises:

in a second training period, inputting voxels corresponding to an ith three-dimensional space-occupying grid into an ith nerve radiation field network model, performing ith iterative training on the ith nerve radiation field network model, and obtaining an estimated color value of each pixel corresponding to each voxel output by the ith nerve radiation field network model and a weight value corresponding to each voxel, wherein i is a positive integer, and when i is 1, the ith three-dimensional space-occupying grid is the initial three-dimensional space-occupying grid;

and screening voxels with a weight value larger than a preset threshold value from voxels output by the ith neural radiation field network model, constructing an (i+1) th three-dimensional space-occupying grid, stopping training until the (i+1) th three-dimensional space-occupying grid is a preset frequency threshold value, and determining the (i+1) th three-dimensional space-occupying grid as a target three-dimensional space-occupying grid.

7. The method of claim 6, wherein the method further comprises:

acquiring new pose information corresponding to the camera;

8. An image sample data acquisition apparatus, the apparatus comprising:

9. A computer device, comprising:

a memory and a processor communicatively coupled to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the image sample data acquisition method of any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the image sample data acquisition method according to any one of claims 1 to 7.