CN112652059B

CN112652059B - Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method

Info

Publication number: CN112652059B
Application number: CN202011642349.7A
Authority: CN
Inventors: 刘嵩; 周梓涵; 来庆涵
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-06-14
Anticipated expiration: 2040-12-31
Also published as: CN112652059A

Abstract

The disclosure provides a Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method, which comprises the steps of obtaining an image including a target to be identified; preprocessing the image by utilizing a generated countermeasure network; performing two-dimensional target detection on the original image and the preprocessed image by using a GA-RPN network model to obtain the position, the anchor frame and the classification of a target object; carrying out voxel conversion by utilizing the preprocessed image and the obtained position, anchor frame and classification data of the target object to obtain three-dimensional voxel information of the target object; thinning the obtained three-dimensional voxel information of the target object to obtain a final 3D model of the target object; the method and the device can realize target detection and 3D model construction of the target in the image more quickly and efficiently.

Description

Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method

Technical Field

The disclosure relates to the technical field of image processing, in particular to a Mesh R-CNN (three-dimensional grid area convolutional neural network) model-based improved target detection and three-dimensional reconstruction method.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In recent years, deep learning is greatly promoted in the field of combining 2D target detection and 3D reconstruction, a neural network is utilized to learn the representation of three-dimensional voxels, point clouds, grids and the like, and the understanding and the development of a three-dimensional world are promoted.

2D object detection represented by fast R-CNN and Mask R-CNN has been studied with good results and applied to various fields. On the basis of the development of 2D target detection, more accurate and intuitive combined 2D target detection and 3D reconstruction are provided, 3D target detection is used for detecting a bounding box of a target object in a picture and generating a three-dimensional model of the target object, and compared with 2D target detection, the information in the picture can be more comprehensively extracted by combining 2D target detection and 3D reconstruction.

The inventor of the present disclosure finds that although two-dimensional object detection has been rapidly developed, and studies on 2D object detection represented by fast R-CNN and Mask R-CNN have achieved good results and are applied to various fields, a single two-dimensional object detection task ignores three-dimensional information of an object and cannot extract 3D information of the object to be detected in a picture.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method, which can more efficiently realize target detection in an image and 3D model construction of a target.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

the first aspect of the disclosure provides a Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method.

A Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method comprises the following steps:

acquiring an image including an object to be recognized;

preprocessing the image by utilizing a generated countermeasure network;

performing two-dimensional target detection on the original image and the preprocessed image by using a GA-RPN network model to obtain an anchor frame area of a target object;

carrying out voxel conversion by using an anchor frame region generated by target detection and combining a Pix2Vox method to obtain three-dimensional voxel information of a target object;

and thinning the obtained three-dimensional voxel information of the target object to obtain a final 3D model of the target object.

The second aspect of the disclosure provides a Mesh R-CNN model-based improved target detection and three-dimensional reconstruction system.

A Mesh R-CNN model-based improved target detection and three-dimensional reconstruction system comprises:

a data acquisition module configured to: acquiring an image including an object to be recognized;

an image processing module configured to: preprocessing the image by utilizing a generated countermeasure network;

a two-dimensional object recognition module configured to: performing two-dimensional target detection on the original image and the preprocessed image by using a GA-RPN network model to obtain an anchor frame area of a target object;

a voxel conversion module configured to: carrying out voxel conversion by using an anchor frame region generated by target detection and combining a Pix2Vox method to obtain three-dimensional voxel information of a target object;

a 3D conversion module configured to: and refining the obtained three-dimensional voxel information of the target object by utilizing a PNA method to obtain a final 3D model of the target object.

A third aspect of the present disclosure provides a medium, on which a program is stored, which when executed by a processor, implements the steps in the Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method according to the first aspect of the present disclosure.

A fourth aspect of the present disclosure provides an electronic device, including a memory, a processor, and a program stored in the memory and executable on the processor, where the processor implements the steps in the Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method according to the first aspect of the present disclosure when executing the program.

Compared with the prior art, the beneficial effect of this disclosure is:

1. according to the Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method, the generated countermeasure network is used for processing the original image, so that the robustness is enhanced, and the model is more robust.

2. According to the Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method, GA-RPN is used for replacing Mask R-CNN in the Mesh R-CNN to perform target detection, an anchor frame guiding and positioning method is used for processing pictures to determine the position and the anchor frame of a target object, and the accuracy of target identification is improved.

3. The invention provides a Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method, wherein a Pix2Vox method is used for carrying out voxel conversion to obtain more accurate three-dimensional voxel information of a target object

4. According to the Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method, PNA is used for replacing GCN in the Mesh R-CNN to refine a three-dimensional body, and the PNA adopts a multi-combiner mode, so that voxel information can be better extracted.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

Fig. 1 is a schematic flow chart of a target detection and three-dimensional reconstruction method provided in embodiment 1 of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example 1:

as shown in fig. 1, embodiment 1 of the present disclosure provides a more efficient and accurate 2D target detection and 3D reconstruction framework (AFM R-CNN) based on Mesh R-CNN, which specifically includes:

processing the image by using a generation countermeasure network to improve the robustness of the model;

performing parallel operation on the picture generated by the generated countermeasure network and the original picture, and performing two-dimensional target detection by using GA-RPN to obtain a two-dimensional position of a target object;

inputting an anchor frame region obtained by target detection into a Pix2VOX, converting two-dimensional features of an image into three-dimensional voxels, and fusing to generate initial rough voxels;

inputting the obtained initial rough voxel into a cubic method (cubify) layer, converting the three-dimensional voxel into a three-dimensional grid point, and optimizing the voxel by using an accurate grid prediction branch;

and thinning the voxel after optimization through two PNA thinning layers to obtain a final 3D model.

More specifically, the method comprises the following steps:

s1: generating a countermeasure network

In order to enhance the robustness of the model, the embodiment uses a generation countermeasure network (GAN) to process the original picture to generate a false picture, and the GAN is a generation model using back propagation and can generate a false sample close to a real sample.

Two models are contained in the GAN model framework: generating a Model (generic Model) and a discriminant Model (discriminant Model), wherein the generating Model is used for generating a picture, the discriminant Model is used for determining whether the picture is true or false, the two models play a game with each other to generate an output picture, and the calculation formula is as follows:

wherein x_imgRepresenting an input picture, P_data(x_img) Representing the distribution of the input pictures.

A formula is generated for the arbiter,

a formula is generated for the generator.

The mutual game enables the data generated by the generator G to be closer to the real data, the identification capability of the discriminator D is maximized, then the data is iterated, the modeling capability of the generated network is continuously improved, the judgment capability of the discrimination network is continuously optimized, and finally the output picture is obtained.

S2: object detection module

And replacing Mask R-CNN in the Mesh R-CNN with GA-RPN to detect the target, processing the picture by using an anchor frame guiding and positioning method, and determining the position and the anchor frame of the target object.

The GA-RPN generates a probability map indicating the position of the target object according to the feature map by the position prediction branch, then outputs the position information of the anchor frame, then the shape prediction branch generates a related anchor frame shape according to the generated position information, finally determines the most probable shape at each position by setting a threshold value, and generates a group of anchor frames by combining the position information.

The probability distribution of the position and shape is calculated as follows:

p(x,y,w,h|I)＝p(x,y|I)p(w,h|x,y,I) (2)

where (x, y) denotes the position of the anchor frame, and w and h denote the width and height of the anchor frame.

S3: voxel transform fusion

The voxel conversion is to convert two-dimensional information in a two-dimensional picture into three-dimensional information, and the Mesh R-CNN uses a single voxel conversion layer to convert the two-dimensional information into the three-dimensional information, so that the two-dimensional information cannot be fully utilized.

In the embodiment, a Pix2Vox method is used for three-dimensional conversion, an encoder is used for generating feature maps from input images, a decoder takes each feature map as input to generate a corresponding coarse three-dimensional body, finally, context fusion is carried out on the generated decoding and encoding results, a fusion module adaptively selects a highest-quality result for each part from coarse three-dimensional voxels, and a finally fused three-dimensional body is output.

The rough voxels generated by the decoder enter a perception context fusion module to generate context information of each rough voxel, the obtained rough voxel information is fused, the fusion module designs each voxel to generate a score map, and the score map is calculated as follows:

wherein, f represents the number of views,

is the result of the score of point (n, m, k) on the qth voxel. x is the number of_rRepresenting the rating generated by the nth bold face of the context rating network.

Then, the score maps are subjected to weighted summation and fused into a voxel V^zThe spatial information of the voxel is retained to the greatest extent, and the formula is as follows:

s4: generating a final 3D model

The three-dimensional body is refined by using PNA instead of GCN in Mesh R-CNN, the PNA adopts a multi-combiner mode, voxel information can be better extracted, and because the PNA cannot directly refine the voxels, the generated voxels are converted into three-dimensional grids by adopting a cubify method.

And thinning the voxels through two PNA thinning layers to obtain a final 3D model. The PNA may amplify or attenuate a signal according to the degree of each node using a node degree-based scaler. The PNA is a main neighborhood aggregation network, and by combining a plurality of aggregators and node-based scalers, each node can better understand the received information distribution, and the performance of the GNN is effectively improved.

Here, model losses need to be calculated to optimize the three-dimensional reconstruction model. The model loss calculation method is as follows, because the calculation of the loss function on the three-dimensional grid is very difficult, the dense sampling is carried out on the surface of the three-dimensional grid by using the point cloud, and the point cloud loss is taken as the shape loss. And setting two point cloud sets F and H with normal vectors, and using the normal vector distance and the chamfer angle distance of the F and the H as point cloud loss.

The normal vector distance of F and H is as follows:

the chamfer distances were as follows:

wherein Λ_F,H＝{(f,arg min_hF belongs to F) is a set of (F, H), F is the adjacent point of H on the point cloud H, mu is_fThe normal vector representing point f.

Only point cloud loss is used for optimization, which can result in default shares, so that edge loss is added, and the grid prediction quality is improved, and the formula is as follows:

wherein

Representing the edges of the predicted mesh and v representing the vertices.

Example 2:

an embodiment 2 of the present disclosure provides a target detection and three-dimensional reconstruction system, including:

a two-dimensional object recognition module configured to: performing two-dimensional target detection on the original image and the preprocessed image by using a GA-RPN network model to obtain the position, the anchor frame and the classification of a target object;

a voxel conversion module configured to: carrying out voxel conversion by utilizing the preprocessed image and the obtained position, anchor frame and classification data of the target object to obtain three-dimensional voxel information of the target object;

a 3D model generation module configured to: and thinning the obtained three-dimensional voxel information of the target object to obtain a final 3D model of the target object.

The working method of the system is the same as the target detection and three-dimensional reconstruction method provided in embodiment 1, and details are not repeated here.

Example 3:

the embodiment 3 of the present disclosure provides a medium, on which a program is stored, and when the program is executed by a processor, the method implements the steps of the target detection and three-dimensional reconstruction method according to the embodiment 1 of the present disclosure, where the steps are:

acquiring an image including an object to be recognized;

preprocessing the image by utilizing a generated countermeasure network;

performing two-dimensional target detection on the original image and the preprocessed image by using a GA-RPN network model to obtain the position, the anchor frame and the classification of a target object;

The detailed steps are the same as those of the target detection and three-dimensional reconstruction method provided in embodiment 1, and are not described herein again.

Example 4:

an embodiment 4 of the present disclosure provides an electronic device, including a memory, a processor, and a program stored in the memory and capable of running on the processor, where the processor implements steps in the target detection and three-dimensional reconstruction method according to embodiment 1 of the present disclosure when executing the program, where the steps are:

acquiring an image including an object to be recognized;

preprocessing the image by utilizing a generated countermeasure network;

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method is characterized by comprising the following steps: the method comprises the following steps:

acquiring an image including an object to be recognized;

preprocessing the image by utilizing a generated countermeasure network;

thinning the obtained three-dimensional voxel information of the target object to obtain a final 3D model of the target object;

the GA-RPN generates a probability graph indicating the position of the target object according to the characteristic graph by the position prediction branch and outputs the position information of the anchor frame;

the shape prediction branch generates a related anchor frame shape according to the generated position information, determines the most possible shape at each position through a set threshold value, and generates a group of anchor frames by combining the position information;

performing two-dimensional target detection on the original image and the preprocessed image by using the GA-RPN network model, and obtaining the position and classification information of a target object;

generating feature maps by using an encoder and an input preprocessed image, and generating a corresponding coarse three-dimensional body by taking each feature map as input by a decoder;

performing context fusion on the generated decoding results, adaptively selecting the highest-quality result for each part from the bold pixels by the fusion module, and outputting a final fused three-dimensional body;

the coarse voxels generated by the decoder enter a fusion module to generate context information of each coarse voxel, and the obtained coarse voxel information is fused, wherein the fusion module comprises:

the fusion module designs each voxel to generate a score map;

then, carrying out weighted summation on the score map, and fusing the score map into a voxel;

firstly, the generated voxels are converted into three-dimensional grids by a cubify method, and then the three-dimensional body is refined by PNA.

2. The Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method of claim 1, wherein:

and carrying out dense sampling on the surface of the three-dimensional grid by using the point cloud, taking the point cloud loss as the shape loss, and adding the edge loss to improve the grid prediction quality and optimize the three-dimensional reconstruction model.

3. A Mesh R-CNN model-based improved target detection and three-dimensional reconstruction system is characterized in that: the method comprises the following steps:

a 3D model generation module configured to: thinning the obtained three-dimensional voxel information of the target object to obtain a final 3D model of the target object;

performing two-dimensional target detection on the original image and the preprocessed image by using a GA-RPN network model, and obtaining the position and classification information of a target object;

the fusion module designs each voxel to generate a score map;

then carrying out weighted summation on the score map, and fusing the score map into a voxel;

firstly, converting the generated voxels into three-dimensional grids by using a cubify method, and then refining the three-dimensional body by using PNA.

4. A medium having a program stored thereon, wherein the program, when executed by a processor, implements the steps of the Mesh R-CNN model based improved object detection and three-dimensional reconstruction method of any one of claims 1-2.

5. An electronic device comprising a memory, a processor and a program stored on the memory and executable on the processor, wherein the processor implements the steps of the Mesh R-CNN model-based improved target detection and three-dimensional reconstruction method according to any one of claims 1-2 when executing the program.