CN109934847B

CN109934847B - Method and device for estimating posture of weak texture three-dimensional object

Info

Publication number: CN109934847B
Application number: CN201910168783.7A
Authority: CN
Inventors: 刘万凯; 刘力; 李中源; 张小军
Original assignee: Shichen Information Technology Shanghai Co ltd
Current assignee: Shichen Information Technology Shanghai Co ltd
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2020-05-22
Anticipated expiration: 2039-03-06
Also published as: CN109934847A

Abstract

The invention provides a method and a device for estimating the posture of a three-dimensional object with weak texture, wherein the method comprises the following steps: importing a three-dimensional object model, rendering the three-dimensional object model into a plane image according to a target background image, the three-dimensional object model and a corresponding posture, and performing deep neural network training; and carrying out object detection recognition and posture estimation on the input plane image by using the trained detection model file, and optimizing the estimated object posture. The invention utilizes the deep neural network model to estimate the pose of the object in real time, thereby improving the precision and robustness of the detection and identification and the pose estimation of the three-dimensional object with weak texture.

Description

Method and device for estimating posture of weak texture three-dimensional object

Technical Field

The embodiment of the invention relates to the field of computer vision, in particular to a method and a device for estimating the posture of a weak texture three-dimensional object.

Background

The three-dimensional object tracking can project a certain object in a known image on a plane image after converting the points through proper six-degree-of-freedom (R, t) on the premise of a three-dimensional point position or a three-dimensional model. The three-dimensional object can be divided into rich texture and weak texture according to the richness degree of the surface texture. For the texture-rich three-dimensional object, a method of extracting image feature point descriptors and matching the image feature point descriptors with known three-dimensional model feature points can be adopted for posture estimation and tracking. But for weakly textured three-dimensional objects, it is difficult to estimate pose using point-to-point matching due to the lack of robust feature descriptors.

Some of the posture estimation methods currently employ deep learning methods. Deep learning is a method based on characterization learning of data in machine learning. An observation (e.g., an image) may be represented using a number of ways, such as a vector of intensity values for each pixel, or more abstractly as a series of edges, a specially shaped region, etc. Tasks (e.g., face recognition or facial expression recognition) are more easily learned from the examples using some specific representation methods. The benefit of deep learning is to replace the manual feature acquisition with unsupervised or semi-supervised feature learning and hierarchical feature extraction efficient algorithms. However, supervised deep learning relies on labeled training data sets, and a large number of economically reliable labeled data sets are necessary preconditions for the success of the model.

Chinese patent application 201711183555.4 in prior art 1 discloses a tracking system for a three-dimensional object and a tracking method thereof, wherein the tracking system for a three-dimensional object includes a key frame forming unit, a video frame outer parameter analyzing unit, and a tracking determining unit, wherein the key frame forming unit analyzes data of a template frame and data of a video frame to form a key frame data, wherein the key frame data includes outer parameters of the key frame, the video frame outer parameter analyzing unit is communicatively connected to the key frame forming unit, wherein the video frame outer parameter analyzing unit is capable of acquiring the outer parameters of the key frame and calculating the outer parameters of the video frame according to the outer parameters of the key frame, wherein the tracking determining unit is communicatively connected to the video frame outer parameter analyzing unit, wherein the video frame analyzing module acquires the data of the key frame and the data of the video frame, and calculating the corresponding pose of the tracked object in the video frame. This patent describes a tracking system that utilizes key frames and feature points and cannot cope with object tracking of textures.

Chinese patent application 201280055792.1 of prior art 2 discloses a method and apparatus for tracking a three-dimensional object. The method for tracking a three-dimensional object comprises the following steps: building a database using a tracking background to store a set of two-dimensional images of the three-dimensional object, wherein the tracking background contains at least one known pattern; receiving a tracking image; determining whether the tracking image matches at least one image in the database according to the feature points of the tracking image; and providing information about the tracking image in response to the tracking image matching the at least one image in the database. The method of constructing a database further comprises: capturing the set of two-dimensional images of the three-dimensional object with the tracking background; extracting a set of feature points from each two-dimensional image; and storing the set of feature points in the database. The patent proposes an earlier method, which is a traditional computer vision method, and the patent utilizes the tracking of characteristic points and cannot deal with the tracking of objects with textures.

However, in the process of implementing the invention, the inventor finds that the prior art has at least the following problems:

the estimation of the six-degree-of-freedom attitude of an object has been studied for a long time in the academic and industrial fields, but the traditional computer vision method has difficulty in well estimating the initial attitude of the object. Until the appearance of a convolutional neural network, the neural network is not used in recent years, the operating principle of the convolutional neural network cannot be completely explained by academic circles and industrial circles, a scene which cannot be well recognized by a partial deep neural network exists, the scene cannot be predicted before the training of the neural network is completed, and high-precision object posture data cannot be obtained by simply depending on the deep neural network.

Different from the traditional classification and plane positioning neural network, the cost and difficulty of the manual labeling data of the three-dimensional object are high, the precision of the method adopting the traditional manual labeling data is low, and extra uncertain factors are easily introduced. If a more accurate data set is desired, the amount of labeling required will increase geometrically. The existing deep learning method is difficult to be applied in a large scale due to the lack of labeling data.

It should be noted that the above background description is only for the sake of clarity and complete description of the technical solutions of the present invention and for the understanding of those skilled in the art. Such solutions are not considered to be known to the person skilled in the art merely because they have been set forth in the background section of the invention.

Disclosure of Invention

In view of the above problems, an object of the embodiments of the present invention is to provide a method and an apparatus for estimating a posture of a weak texture three-dimensional object, in which a general annotation training method is provided, and a deep neural network model is used to estimate a posture of an object in real time, so as to improve accuracy and robustness of posture estimation of the weak texture three-dimensional object.

In order to achieve the above object, an embodiment of the present invention provides a method for estimating a pose of a weak texture three-dimensional object, including: importing a three-dimensional object model, rendering the three-dimensional object model into a plane image according to a target background image, the three-dimensional object model and a corresponding gesture, and performing deep neural network training(ii) a The importing the three-dimensional object model, rendering the three-dimensional object model into a planar image according to the three-dimensional object model and the corresponding posture, specifically includes: inputting a three-dimensional object model file, a posture and target background image information; rendering different postures of the three-dimensional object model into a planar image according to the input three-dimensional object model file, the postures and the target background image information to form combined data of different posture visual angles and different target backgrounds; when different postures of the three-dimensional object model are rendered to a plane image, acquiring truth value data required by neural network training, wherein the truth value data comprises: full-automatic data obtained by utilizing the rendering and the tracked semi-automatic semi-manual marking are utilized; the deep neural network training specifically comprises: dividing the planar image into a plurality of meshes; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data; the method comprises the following steps of carrying out object detection recognition and posture estimation on an input plane image by utilizing a trained detection model file, and optimizing the estimated object posture, and specifically comprises the following steps: carrying out object detection and identification on the plane image by using the trained detection model file, and estimating the initial posture of the identified object; and optimizing the estimated object attitude through an error energy function, wherein the error energy function is as follows: e ═ pi (o)_i，g_cam，obj)-x_i，|²Wherein, pi (o)_i，g_cam，obj) As a projection function of the edge sampling points of the object, o_iAs edge sampling points, g_cam，objFor projective transformation, x, of an object coordinate system to a camera plane coordinate system_i’Representing the nearest edge point coordinate from the point projection point i' on the planar image.

Wherein, the semi-automatic semi-manual mark of tracking specifically is: tracking the three-dimensional object model in real time, and performing semi-automatic semi-manual labeling by recording the tracked image and the object posture in real time; or, firstly, aligning the initial posture of the three-dimensional object model, then tracking the three-dimensional object model in real time through pose optimization, outputting the six-degree-of-freedom posture of the object in the tracking process, and performing semi-automatic semi-manual labeling; annotation refers to the process of labeling an image or data.

When deep neural network training is carried out, full-automatic data obtained by rendering are used as input samples to carry out pre-training of a deep neural network model, in the initial stage, along with the increase of training samples, the model detection rate can be improved to be obvious, then along with the increase of the number of samples, if the detection rate is improved to be gradually slowed down, and when the input samples are increased and the actual detection rate is improved to be lower than a set threshold value, the deep neural network model is further trained by adopting marking data of full-manual marking and/or semi-automatic semi-manual marking.

During the training process, the loss of the object detection model is further calculated, wherein the loss function of the object model is as follows:

Loss＝λ_obj＊Loss_obj+λ_width＊Loss_width+λ_shift＊Loss_shift。

wherein λ is_obj、λ_widthAnd λ_shiftTo adjust the coefficient, λ_objFor target object detection, λ_widthDetecting the frame width, λ, for the target object_shiftFrame displacement is detected for the target object.

Loss_objFor the object classification loss, the concrete formula is as follows:

Loss_obj＝∑_iProb(obj_i)log(Prob(obj_i) Wherein, Prob (obj)_i) Representing the confidence probability of the object i.

Loss_widthFor the object bounding box loss, the specific formula is:

where Wxi represents the predicted box width for object i, Wyi represents the predicted box height for object i,

representing the true value of the predicted box width for object i,

representing the true value of the predicted box height for object i.

Loss_shiftFor the object center offset loss, the specific formula is:

wherein 1 is_iWhen the classification function returns the corresponding object id, the classification function is 1, otherwise, the classification function is 0; x is the number of_iRepresents the x coordinate of the central point of the prediction frame of the object i, yi represents the y coordinate of the central point of the prediction frame of the object i,

which represents the true value of the x-coordinate,

representing the true value of the y coordinate.

The embodiment of the invention also provides a device for estimating the posture of the weak texture three-dimensional object, which comprises the following components: the three-dimensional object model rendering module is used for rendering the three-dimensional object model into a plane image according to the target background image, the three-dimensional object model and the corresponding gesture; the three-dimensional image rendering method is specifically used for rendering different postures of a three-dimensional object model into a planar image according to an input three-dimensional object model file, a posture and planar image background information to form combined data of different posture visual angles and different backgrounds; when different postures of the three-dimensional object model are rendered to a plane image, acquiring truth value data required by neural network training, wherein the truth value data comprises: full-automatic data obtained by utilizing the rendering and the tracked semi-automatic semi-manual marking are utilized; the model training module is used for carrying out deep neural network training; the model trainingThe training module is also used for dividing the plane image into a plurality of grids; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data; in the training process, calculating the loss of the object detection model; the attitude estimation module is used for carrying out object detection recognition and attitude estimation on the input planar image by utilizing the trained detection model file; the system is also used for carrying out object detection and identification on the plane image by utilizing the trained detection model file and estimating the initial posture of the identified object; the pose optimization module is used for optimizing the estimated object pose; and the system is further configured to optimize the estimated object pose by an error energy function, where the error energy function is: e ═ pi (o)_{i，gcam，obj})-x_i，|²Wherein, pi (o)_{i，gcam，obj}) As a projection function of the edge sampling points of the object, o_iAs edge sampling points, g_cam，objFor projective transformation, x, of an object coordinate system to a camera plane coordinate system_i’Representing the nearest edge point coordinate from the point projection point i' on the planar image.

The embodiment of the present invention further provides a device for estimating a posture of a three-dimensional object with weak texture, including a memory and a processor, wherein: the memory is used for storing codes and documents; a processor is used to execute the code and documents stored in the memory to implement the method of weak texture three-dimensional object pose estimation as previously described.

Compared with the traditional computer vision method which is difficult to perceive the position of an object in a plane image, the method and the device for estimating the posture of the weak texture three-dimensional object provided by the embodiment of the invention fuse the deep neural network and the traditional computer vision method, can realize the estimation of the initial posture of the existing object through the deep neural network on the premise of not acquiring the depth image, and further optimize the initial posture, thereby obtaining the result with higher precision and robustness.

In addition, for the acquisition of the marking data, a full-automatic computer rendering mode can be adopted in the embodiment of the invention, so that the marking labor cost can be greatly saved on the premise of ensuring the data precision, and the high-precision posture data with six degrees of freedom can be obtained. The method can also adopt a mode of manually collecting data and combining semi-automatic marking, not only can acquire real data, but also can overcome the problem that rendering data is inconsistent with a real scene, thereby improving the perception and detection capability of the model to the real physical world.

Therefore, the method and the device have the advantages that the deep neural network model is utilized to estimate the pose of the object in real time, the labor cost for marking is saved, and the precision and the robustness of the estimation of the weak texture three-dimensional object pose are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a method for estimating a posture of a three-dimensional object with weak texture according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of deep neural network training according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of estimating a planar image to be detected according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an apparatus for estimating a pose of a three-dimensional object with weak texture according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of another apparatus for estimating a pose of a three-dimensional object with weak texture according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

The embodiment of the invention provides a method for estimating the posture of a three-dimensional object with weak texture, which mainly comprises deep neural network training, three-dimensional object recognition, three-dimensional object segmentation and posture estimation and three-dimensional object posture adjustment and optimization. As shown in fig. 1, the method specifically comprises the following steps:

step 1, importing a three-dimensional object model, rendering the three-dimensional object model into a plane image according to a target background image, the three-dimensional object model and a corresponding posture, and performing deep neural network training.

The detailed mode of the step can be seen in fig. 2.

Step 101, inputting a three-dimensional object model file, a posture and target background image information.

In this embodiment, if the three-dimensional object model is known, the three-dimensional object model file can be imported directly. If no three-dimensional object model file exists currently, the three-dimensional object model file can be generated through external scanning and other modes. The target background image information is used for projecting and superposing the subsequently rendered object on the plane background so as to simulate a real environment. In the gesture generating process, because the actual object is usually concentrated on several common viewing angles, before the gesture is generated, the gesture of the object may be counted first, and a corresponding gesture is generated based on the statistical analysis result of the gesture of the actual object, for example, a six-degree-of-freedom gesture (R, t) corresponding to the probability distribution is generated.

And 102, rendering different postures of the three-dimensional object model into a target background image to form combined data of different posture visual angles and different target backgrounds.

In this embodiment, the three-dimensional object rendering module renders different poses of the three-dimensional object model into the target background image according to the input three-dimensional object model file, the poses and the target background image information. The rendering projection of the three-dimensional object model is a basic function of industry passing, and the innovation point of the embodiment is that the three-dimensional object model and different postures are rendered on a random background image to form a series of combined data of different posture visual angles and different backgrounds, so that all possible situations in daily life are covered as much as possible.

And 103, acquiring true value data required by neural network training when different postures of the three-dimensional object model are rendered to the plane image.

In this embodiment, when rendering different poses of the three-dimensional object model to a planar image, the truth data required for neural network training is obtained, where the truth data includes: full-manual labeling, semi-automatic and semi-manual labeling by utilizing tracking, and full-automatic data acquired by rendering. The annotation refers to marking the current image or data with a corresponding label.

Because the labeling difficulty of a single image is higher, the method for tracking the three-dimensional object model is adopted for continuous labeling, so that the labeling cost can be saved. Specifically, by using a three-dimensional object model tracking module, for example, an Edge Distance Field (EDF) -based method is adopted, the posture of the object can be tracked and output in real time, and semi-automatic labeling is realized by recording the tracked image and the posture of the object in real time.

Or, firstly, manually aligning the initial posture of the three-dimensional object model, namely projecting a virtual three-dimensional object onto a plane image, and manually adjusting the position of a current camera to enable an actual object in the image to be superposed with the virtual object to serve as an initial posture; and then calling a pose optimization module to track the three-dimensional object model in real time, and outputting the six-degree-of-freedom pose of the object in an off-line or on-line mode in the tracking process to perform semi-automatic and semi-manual marking.

And 104, training a deep neural network model according to the acquired truth value data.

In this embodiment, a computer may be used to render an image for pre-training a deep neural network model, where full-automatic data obtained by rendering is used as an input sample of the deep neural network for training, the precision needs to be determined according to an actual situation, and the general determination standard is as follows: if the input samples are added and the actual detection rate is not obviously improved (for example, a detection rate improvement threshold value is set, and if the detection rate improvement threshold value is lower than the threshold value, the detection rate improvement is not obvious), the process of pre-training by utilizing the computer-rendered image is stopped, and then the deep neural network model is further trained by adopting the marking data of full-manual marking and/or semi-automatic semi-manual marking instead.

In the actual training process, if the whole image is used for training, the model is difficult to converge. Therefore, in the present embodiment, it is preferable to divide the two-dimensional image into a plurality of meshes, and return the coordinates of the center point in each mesh and the corresponding confidence probability (confidence probability) to perform model training. The annotation data input to the model training module may include a planar image, a corresponding three-dimensional object ID, center coordinates (x, y), and object sash size (Wx, Wy). The three-dimensional object ID is automatically numbered according to the sequence of the model file, and the function is to compare the returned value of the target object. The center coordinates (x, y) represent the center of the target detection frame. The object cell size (Wx, Wy) represents the detection cell size.

Specifically, the input image is divided into an S × S grid. If the center of an object falls within the grid, the grid is responsible for detecting the object. Each mesh predicts B (which may be 2) bounding boxes, each corresponding to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, where x, y represent the center point coordinates of the mesh, w, h are the predicted width and height relative to the entire image, the confidence probability is whether the object is contained and the degree of positional accuracy, and the object id and pose orientation are determined by the confidence probability.

The calculation formula of the confidence probability is as follows: (Pr (object)) IOU (pred | truth)) reflecting the accuracy of the predicted location and the probability of the presence of the object in combination, wherein (Pr (object) reflects the likelihood of the object being present in the mesh bounding box based on the current model and (IOU (pred | truth)) reflects the accuracy of the location of the predicted object in the mesh bounding box. If there is no object in the bounding box, the confidence probability should be 0; if so, the confidence probability is the IOU (interaction over Unit) between the predicted bounding box and the actual bounding box. And if the center of the object falls into the boundary box, labeling the five prediction parameters of the object and the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data.

In the training process, the loss of the object detection model is also considered, and the loss function of the object model is as follows:

Loss＝λ_obj＊Loss_obj+λ_width＊Loss_width+λ_shift＊Loss_shift。

Loss_objFor the object classification loss, the concrete formula is as follows:

Loss_widthFor the object bounding box loss, the specific formula is:

representing the true value of the predicted box width for object i,

representing the true value of the predicted box height for object i.

Loss_shiftFor the object center offset loss, the specific formula is:

wherein 1 is_iWhen the classification function returns the corresponding object id, the classification function is 1, otherwise, the classification function is 0;

xi represents the x coordinate of the center point of the prediction frame of the object i, yi represents the y coordinate of the center point of the prediction frame of the object i,

which represents the true value of the x-coordinate,

representing the true value of the y coordinate.

Step 105, evaluating the test set, and returning to step 104 if the test set fails; if so, step 106 is performed, i.e., the inspection model file is output.

In this step, the test set is similar in form to the training set described above, except for the purpose. The training set is used to train the network model, while the test set is used to evaluate the neural network performance. A test threshold is preset, which is typically based on the IOU, which is a criterion that measures the accuracy of detecting a corresponding object in a particular data set.

If the threshold value is not reached, the model fails, and the deep neural network model is trained after the labeled data is added.

If the threshold value is reached, outputting a detection model file if the threshold value is passed, wherein the model detection file can be the weight value of each node of the deep neural network.

And 2, carrying out object detection recognition and posture estimation on the input plane image by using the trained detection model file, and optimizing the estimated object posture.

The detailed mode of the step can be seen in fig. 3.

Step 201, inputting a planar image to be detected.

In this embodiment, the input data may be a planar image to be detected after distortion removal. In general, real cameras have certain distortion during shooting and imaging, and the distortion removal can be to eliminate the distortion on the image by a camera calibration method.

And 202, carrying out object detection recognition and attitude estimation on the plane image by using the trained detection model file.

In the embodiment, the trained detection model file is used to perform object detection and recognition on the plane image, and the initial posture of the recognized object is estimated, that is, the input of the detection model file is the plane image, and the plane image is processed by the internal neural network and output as the three-dimensional object ID, the center coordinates (x, y), and the object frame size Wx, Wy in the estimated plane image. The object posture can be estimated through the central point coordinate and the size of the object frame, wherein the object posture comprises six degrees of freedom, xyz and three degrees of freedom of an angle, xy can be solved through the central point coordinate, the object depth z can be solved according to the size of the frame, and the angle is solved through the orientation of the object output by the neural network.

And 203, optimizing the pose of the object and determining the posture of the three-dimensional object.

In this embodiment, the neural network can only make one estimation on the initial pose of the object, and there is still an error between the actual pose and the initial pose, so that further optimization on the estimated pose of the object is required. Preferably, the object pose can be further optimized by using methods such as an edge distance field, and the error energy function is a reprojection error, specifically:

E＝|π(o_i，g_cam，obj)-x_i|²，

wherein, pi (o)_i，g_cam，obj) As a projection function of the edge sampling points of the object, o_iAs edge sampling points, g_cam，objFor projective transformation, x, of an object coordinate system to a camera plane coordinate system_iRepresenting the coordinates of the nearest edge point on the planar image from the point projection point i.

As shown in fig. 4, an embodiment of the present invention further provides an apparatus for estimating a pose of a three-dimensional object with weak texture, including:

a three-dimensional object model rendering module 401, configured to render a three-dimensional object model into a planar image according to a target background image, the three-dimensional object model, and a corresponding gesture;

a model training module 402 for performing deep neural network training;

an attitude estimation module 403, configured to perform object detection and identification and attitude estimation on the input planar image by using the trained detection model file;

and a pose optimization module 404, configured to optimize the estimated object pose.

Wherein the content of the first and second substances,

the three-dimensional object model rendering module 401 is specifically configured to render different poses of the three-dimensional object model into a planar image according to the input information of the three-dimensional object model file, the poses and the target background image, and obtain truth data required by neural network training when combined data of different pose viewing angles and different target backgrounds is rendered into the planar image in different poses of the three-dimensional object model, where the truth data includes: full-manual labeling, semi-automatic and semi-manual labeling by utilizing tracking, and full-automatic data obtained by rendering;

the model training module 402 is specifically configured to pre-train a deep neural network model by using full-automatic data obtained by rendering as an input sample, wherein in an initial stage, as training samples increase, the detection rate of the model is improved significantly, and then as the number of samples increases, the detection rate is improved and gradually slowed down, and when the actual detection rate increases and is lower than a set threshold value as the input samples increase, the deep neural network model is further trained by using labeled data of full-manual labeling and/or semi-automatic semi-manual labeling; in addition, the model training module 402 is further configured to divide the planar image into a plurality of grids; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data; in the training process, the loss of the object detection model is calculated, and the specific formula of the loss function is the same as that of the method for estimating the weak texture three-dimensional object posture, so detailed description is omitted.

The attitude estimation module 403 is specifically configured to perform object detection and identification on the planar image by using the trained detection model file, and estimate an initial attitude of the identified object;

the pose optimization module 404 is specifically configured to optimize the estimated object pose through an error energy function, where a specific formula of the error energy function is the same as that of the weak texture three-dimensional object pose estimation method, and thus details are not repeated.

The specific technical details of the device for estimating the posture of the weak texture three-dimensional object are similar to the method for estimating the posture of the weak texture three-dimensional object, and therefore detailed description is omitted.

As shown in fig. 5, an apparatus for estimating a pose of a three-dimensional object with weak texture according to an embodiment of the present invention includes a memory and a processor, where:

a memory 501 for storing codes and documents;

a processor 502 for executing the code and documents stored in the memory for implementing the method for weak texture three-dimensional object pose estimation as described previously.

Therefore, the depth neural network and the traditional computer vision method are fused, the estimation of the initial posture of the existing object can be realized through the depth neural network on the premise that the depth image is not obtained, the initial posture is further adjusted and optimized, and the result with higher precision and robustness can be obtained. In addition, the implementation mode of the invention can adopt a full-automatic computer rendering mode, so that the manpower cost for marking can be greatly saved on the premise of ensuring the data precision, and meanwhile, the high-precision posture data with six degrees of freedom can be obtained. The method can also adopt a mode of manually collecting data and combining semi-automatic marking, not only can acquire real data, but also can overcome the problem that rendering data is inconsistent with a real scene, thereby improving the perception and detection capability of the model to the real physical world.

Those skilled in the art will understand that all or part of the steps in the method according to the above embodiments may be implemented by a program instructing related hardware to complete, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments.

Finally, it should be noted that: the foregoing description of various embodiments of the invention is provided to those skilled in the art for the purpose of illustration. It is not intended to be exhaustive or to limit the invention to a single disclosed embodiment. Various alternatives and modifications of the invention, as described above, will be apparent to those skilled in the art. Thus, while some alternative embodiments have been discussed in detail, other embodiments will be apparent or relatively easy to derive by those of ordinary skill in the art. The present invention is intended to embrace all such alternatives, modifications, and variances which have been discussed herein, and other embodiments which fall within the spirit and scope of the above application.

Claims

1. A method for estimating the posture of a weak texture three-dimensional object is characterized by comprising the following steps:

importing a three-dimensional object model, rendering the three-dimensional object model into a plane image according to a target background image, the three-dimensional object model and a corresponding posture, and performing deep neural network training;

the importing the three-dimensional object model, rendering the three-dimensional object model into a planar image according to the three-dimensional object model and the corresponding posture, specifically includes:

inputting a three-dimensional object model file, a posture and target background image information;

rendering different postures of the three-dimensional object model into a planar image according to the input three-dimensional object model file, the postures and the target background image information to form combined data of different posture visual angles and different target backgrounds;

when different postures of the three-dimensional object model are rendered to a plane image, acquiring truth value data required by neural network training, wherein the truth value data comprises: full-automatic data obtained by utilizing the rendering and the tracked semi-automatic semi-manual marking are utilized;

the deep neural network training specifically comprises: dividing the planar image into a plurality of meshes; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data;

the method comprises the following steps of carrying out object detection recognition and posture estimation on an input plane image by utilizing a trained detection model file, and optimizing the estimated object posture, and specifically comprises the following steps: carrying out object detection and identification on the plane image by using the trained detection model file, and estimating the initial posture of the identified object; the estimated object attitude is adjusted and optimized through an error energy functionThe error energy function is: e ═ pi (o)_i，g_cam，obj)-x_i，|²Wherein, pi (o)_i，g_cam，obj) As a projection function of the edge sampling points of the object, o_iAs edge sampling points, g_cam，objFor projective transformation, x, of an object coordinate system to a camera plane coordinate system_i’Representing the nearest edge point coordinate from the point projection point i' on the planar image.

2. The method for estimating the posture of a weakly-textured three-dimensional object according to claim 1, wherein the tracked semi-automatic semi-manual labeling is specifically as follows:

tracking the three-dimensional object model in real time, and performing semi-automatic semi-manual labeling by recording the tracked image and the object posture in real time; alternatively, the first and second electrodes may be,

firstly, aligning the initial posture of a three-dimensional object model, then tracking the three-dimensional object model in real time through pose optimization, outputting the six-degree-of-freedom posture of the object in the tracking process, and carrying out semi-automatic semi-manual labeling;

wherein labeling refers to the process of labeling an image or data.

3. The method for estimating the pose of a weakly-textured three-dimensional object according to claim 1, wherein the deep neural network training specifically comprises:

the method comprises the steps that full-automatic data obtained by rendering are used as input samples to pre-train a deep neural network model, in the initial stage, along with the increase of training samples, the detection rate of the model is improved obviously, then along with the increase of the number of samples, the detection rate is improved and gradually slowed down, and when the number of the input samples is increased and the actual detection rate is improved and is lower than a set threshold value, the deep neural network model is further trained by adopting marking data of full-manual marking and/or semi-automatic semi-manual marking.

4. The method of weak texture three-dimensional object pose estimation according to claim 3, further comprising:

in the training process, calculating the loss of the object detection model, wherein the loss function is as follows:

Loss＝λ_obj＊Loss_obj+λ_width＊Loss_width+λ_shift＊Loss_shift，

wherein λ is_obj、λ_widthAnd λ_shiftTo adjust the coefficient, λ_objFor target object detection, λ_widthDetecting the frame width, λ, for the target object_shiftDetecting frame displacement for the target object;

Loss_objfor the object classification loss, the concrete formula is as follows:

Loss_obj＝∑_iProb(obj_i)log(Prob(obj_i) Wherein, Prob (obj)_i) Representing the confidence probability of the object i;

Loss_widthfor the object bounding box loss, the specific formula is:

representing the true value of the predicted box width for object i,

representing a true value of the predicted box height for object i;

Loss_shiftfor the object center offset loss, the specific formula is:

wherein 1 is_iWhen the classification function returns the corresponding object id, the classification function is 1, otherwise, the classification function is 0; x is the number of_iRepresenting the x-coordinate, y of the center point of the prediction box of the object i_iRepresenting the y coordinate of the center point of the prediction frame of the object i,

which represents the true value of the x-coordinate,

representing the true value of the y coordinate.

5. An apparatus for estimating the pose of a weakly textured three-dimensional object, comprising:

the three-dimensional object model rendering module is used for rendering the three-dimensional object model into a plane image according to the target background image, the three-dimensional object model and the corresponding gesture; the three-dimensional image rendering method is specifically used for rendering different postures of a three-dimensional object model into a planar image according to an input three-dimensional object model file, a posture and planar image background information to form combined data of different posture visual angles and different backgrounds; when different postures of the three-dimensional object model are rendered to a plane image, acquiring truth value data required by neural network training, wherein the truth value data comprises: full-automatic data obtained by utilizing the rendering and the tracked semi-automatic semi-manual marking are utilized;

the model training module is used for carrying out deep neural network training; the model training module is also used for dividing the plane image into a plurality of grids; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data; in the training process, calculating the loss of the object detection model;

the attitude estimation module is used for carrying out object detection recognition and attitude estimation on the input planar image by utilizing the trained detection model file; the system is also used for carrying out object detection and identification on the plane image by utilizing the trained detection model file and estimating the initial posture of the identified object;

the pose optimization module is used for optimizing the estimated object pose; and the system is further configured to optimize the estimated object pose by an error energy function, where the error energy function is: e ═ pi (o)_i，g_cam，obj)-x_i，|²Wherein, pi (o)_i，g_cam，obj) As a projection function of the edge sampling points of the object, o_iAs edge sampling points, g_cam，objFor projective transformation, x, of an object coordinate system to a camera plane coordinate system_i’Representing the nearest edge point coordinate from the point projection point i' on the planar image.

6. The apparatus for estimating pose of a weakly-textured three-dimensional object according to claim 5, wherein the model training module is specifically configured to perform pre-training of the deep neural network model by using fully-automatic data obtained by rendering as input samples, in an initial stage, as the number of training samples increases, the detection rate of the model is significantly improved, and then as the number of samples increases, the improvement of the detection rate gradually slows down, and when the number of input samples increases and the actual detection rate is improved to be lower than a set threshold value, the deep neural network model is further trained by using labeled data labeled by fully-manual labeling and/or semi-automatic semi-manual labeling instead.

7. The apparatus for estimating pose of weak texture three-dimensional object according to claim 6, wherein the model training module further calculates loss of the object detection model during training, the loss function is:

Loss＝λ_obj＊Loss_obj+λ_width＊Loss_width+λ_shift＊Loss_shift，

Loss_objfor the object classification loss, the concrete formula is as follows:

Loss_widthfor the object bounding box loss, the specific formula is:

representing the true value of the predicted box width for object i,

representing a true value of the predicted box height for object i;

Loss_shiftfor the object center offset loss, the specific formula is:

which represents the true value of the x-coordinate,

representing the true value of the y coordinate.

8. An apparatus for weak texture three-dimensional object pose estimation, the apparatus comprising a memory and a processor, wherein:

the memory is used for storing codes and documents;

the processor for executing the code and documents stored in the memory for implementing the method steps of any of claims 1 to 4.