CN109934847B - Method and device for estimating posture of weak texture three-dimensional object - Google Patents

Method and device for estimating posture of weak texture three-dimensional object Download PDF

Info

Publication number
CN109934847B
CN109934847B CN201910168783.7A CN201910168783A CN109934847B CN 109934847 B CN109934847 B CN 109934847B CN 201910168783 A CN201910168783 A CN 201910168783A CN 109934847 B CN109934847 B CN 109934847B
Authority
CN
China
Prior art keywords
model
loss
dimensional object
obj
posture
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910168783.7A
Other languages
Chinese (zh)
Other versions
CN109934847A (en
Inventor
刘万凯
刘力
李中源
张小军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shichen Information Technology Shanghai Co ltd
Original Assignee
Shichen Information Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shichen Information Technology Shanghai Co ltd filed Critical Shichen Information Technology Shanghai Co ltd
Priority to CN201910168783.7A priority Critical patent/CN109934847B/en
Publication of CN109934847A publication Critical patent/CN109934847A/en
Application granted granted Critical
Publication of CN109934847B publication Critical patent/CN109934847B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method and a device for estimating the posture of a three-dimensional object with weak texture, wherein the method comprises the following steps: importing a three-dimensional object model, rendering the three-dimensional object model into a plane image according to a target background image, the three-dimensional object model and a corresponding posture, and performing deep neural network training; and carrying out object detection recognition and posture estimation on the input plane image by using the trained detection model file, and optimizing the estimated object posture. The invention utilizes the deep neural network model to estimate the pose of the object in real time, thereby improving the precision and robustness of the detection and identification and the pose estimation of the three-dimensional object with weak texture.

Description

Method and device for estimating posture of weak texture three-dimensional object
Technical Field
The embodiment of the invention relates to the field of computer vision, in particular to a method and a device for estimating the posture of a weak texture three-dimensional object.
Background
The three-dimensional object tracking can project a certain object in a known image on a plane image after converting the points through proper six-degree-of-freedom (R, t) on the premise of a three-dimensional point position or a three-dimensional model. The three-dimensional object can be divided into rich texture and weak texture according to the richness degree of the surface texture. For the texture-rich three-dimensional object, a method of extracting image feature point descriptors and matching the image feature point descriptors with known three-dimensional model feature points can be adopted for posture estimation and tracking. But for weakly textured three-dimensional objects, it is difficult to estimate pose using point-to-point matching due to the lack of robust feature descriptors.
Some of the posture estimation methods currently employ deep learning methods. Deep learning is a method based on characterization learning of data in machine learning. An observation (e.g., an image) may be represented using a number of ways, such as a vector of intensity values for each pixel, or more abstractly as a series of edges, a specially shaped region, etc. Tasks (e.g., face recognition or facial expression recognition) are more easily learned from the examples using some specific representation methods. The benefit of deep learning is to replace the manual feature acquisition with unsupervised or semi-supervised feature learning and hierarchical feature extraction efficient algorithms. However, supervised deep learning relies on labeled training data sets, and a large number of economically reliable labeled data sets are necessary preconditions for the success of the model.
Chinese patent application 201711183555.4 in prior art 1 discloses a tracking system for a three-dimensional object and a tracking method thereof, wherein the tracking system for a three-dimensional object includes a key frame forming unit, a video frame outer parameter analyzing unit, and a tracking determining unit, wherein the key frame forming unit analyzes data of a template frame and data of a video frame to form a key frame data, wherein the key frame data includes outer parameters of the key frame, the video frame outer parameter analyzing unit is communicatively connected to the key frame forming unit, wherein the video frame outer parameter analyzing unit is capable of acquiring the outer parameters of the key frame and calculating the outer parameters of the video frame according to the outer parameters of the key frame, wherein the tracking determining unit is communicatively connected to the video frame outer parameter analyzing unit, wherein the video frame analyzing module acquires the data of the key frame and the data of the video frame, and calculating the corresponding pose of the tracked object in the video frame. This patent describes a tracking system that utilizes key frames and feature points and cannot cope with object tracking of textures.
Chinese patent application 201280055792.1 of prior art 2 discloses a method and apparatus for tracking a three-dimensional object. The method for tracking a three-dimensional object comprises the following steps: building a database using a tracking background to store a set of two-dimensional images of the three-dimensional object, wherein the tracking background contains at least one known pattern; receiving a tracking image; determining whether the tracking image matches at least one image in the database according to the feature points of the tracking image; and providing information about the tracking image in response to the tracking image matching the at least one image in the database. The method of constructing a database further comprises: capturing the set of two-dimensional images of the three-dimensional object with the tracking background; extracting a set of feature points from each two-dimensional image; and storing the set of feature points in the database. The patent proposes an earlier method, which is a traditional computer vision method, and the patent utilizes the tracking of characteristic points and cannot deal with the tracking of objects with textures.
However, in the process of implementing the invention, the inventor finds that the prior art has at least the following problems:
the estimation of the six-degree-of-freedom attitude of an object has been studied for a long time in the academic and industrial fields, but the traditional computer vision method has difficulty in well estimating the initial attitude of the object. Until the appearance of a convolutional neural network, the neural network is not used in recent years, the operating principle of the convolutional neural network cannot be completely explained by academic circles and industrial circles, a scene which cannot be well recognized by a partial deep neural network exists, the scene cannot be predicted before the training of the neural network is completed, and high-precision object posture data cannot be obtained by simply depending on the deep neural network.
Different from the traditional classification and plane positioning neural network, the cost and difficulty of the manual labeling data of the three-dimensional object are high, the precision of the method adopting the traditional manual labeling data is low, and extra uncertain factors are easily introduced. If a more accurate data set is desired, the amount of labeling required will increase geometrically. The existing deep learning method is difficult to be applied in a large scale due to the lack of labeling data.
It should be noted that the above background description is only for the sake of clarity and complete description of the technical solutions of the present invention and for the understanding of those skilled in the art. Such solutions are not considered to be known to the person skilled in the art merely because they have been set forth in the background section of the invention.
Disclosure of Invention
In view of the above problems, an object of the embodiments of the present invention is to provide a method and an apparatus for estimating a posture of a weak texture three-dimensional object, in which a general annotation training method is provided, and a deep neural network model is used to estimate a posture of an object in real time, so as to improve accuracy and robustness of posture estimation of the weak texture three-dimensional object.
In order to achieve the above object, an embodiment of the present invention provides a method for estimating a pose of a weak texture three-dimensional object, including: importing a three-dimensional object model, rendering the three-dimensional object model into a plane image according to a target background image, the three-dimensional object model and a corresponding gesture, and performing deep neural network training(ii) a The importing the three-dimensional object model, rendering the three-dimensional object model into a planar image according to the three-dimensional object model and the corresponding posture, specifically includes: inputting a three-dimensional object model file, a posture and target background image information; rendering different postures of the three-dimensional object model into a planar image according to the input three-dimensional object model file, the postures and the target background image information to form combined data of different posture visual angles and different target backgrounds; when different postures of the three-dimensional object model are rendered to a plane image, acquiring truth value data required by neural network training, wherein the truth value data comprises: full-automatic data obtained by utilizing the rendering and the tracked semi-automatic semi-manual marking are utilized; the deep neural network training specifically comprises: dividing the planar image into a plurality of meshes; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data; the method comprises the following steps of carrying out object detection recognition and posture estimation on an input plane image by utilizing a trained detection model file, and optimizing the estimated object posture, and specifically comprises the following steps: carrying out object detection and identification on the plane image by using the trained detection model file, and estimating the initial posture of the identified object; and optimizing the estimated object attitude through an error energy function, wherein the error energy function is as follows: e ═ pi (o)i,gcam,obj)-xi,|2Wherein, pi (o)i,gcam,obj) As a projection function of the edge sampling points of the object, oiAs edge sampling points, gcam,objFor projective transformation, x, of an object coordinate system to a camera plane coordinate systemi’Representing the nearest edge point coordinate from the point projection point i' on the planar image.
Wherein, the semi-automatic semi-manual mark of tracking specifically is: tracking the three-dimensional object model in real time, and performing semi-automatic semi-manual labeling by recording the tracked image and the object posture in real time; or, firstly, aligning the initial posture of the three-dimensional object model, then tracking the three-dimensional object model in real time through pose optimization, outputting the six-degree-of-freedom posture of the object in the tracking process, and performing semi-automatic semi-manual labeling; annotation refers to the process of labeling an image or data.
When deep neural network training is carried out, full-automatic data obtained by rendering are used as input samples to carry out pre-training of a deep neural network model, in the initial stage, along with the increase of training samples, the model detection rate can be improved to be obvious, then along with the increase of the number of samples, if the detection rate is improved to be gradually slowed down, and when the input samples are increased and the actual detection rate is improved to be lower than a set threshold value, the deep neural network model is further trained by adopting marking data of full-manual marking and/or semi-automatic semi-manual marking.
During the training process, the loss of the object detection model is further calculated, wherein the loss function of the object model is as follows:
Loss=λobj*Lossobjwidth*Losswidthshift*Lossshift
wherein λ isobj、λwidthAnd λshiftTo adjust the coefficient, λobjFor target object detection, λwidthDetecting the frame width, λ, for the target objectshiftFrame displacement is detected for the target object.
LossobjFor the object classification loss, the concrete formula is as follows:
Lossobj=∑iProb(obji)log(Prob(obji) Wherein, Prob (obj)i) Representing the confidence probability of the object i.
LosswidthFor the object bounding box loss, the specific formula is:
Figure GDA0002422858570000051
where Wxi represents the predicted box width for object i, Wyi represents the predicted box height for object i,
Figure GDA0002422858570000052
representing the true value of the predicted box width for object i,
Figure GDA0002422858570000053
representing the true value of the predicted box height for object i.
LossshiftFor the object center offset loss, the specific formula is:
Figure GDA0002422858570000054
wherein 1 isiWhen the classification function returns the corresponding object id, the classification function is 1, otherwise, the classification function is 0; x is the number ofiRepresents the x coordinate of the central point of the prediction frame of the object i, yi represents the y coordinate of the central point of the prediction frame of the object i,
Figure GDA0002422858570000055
which represents the true value of the x-coordinate,
Figure GDA0002422858570000056
representing the true value of the y coordinate.
The embodiment of the invention also provides a device for estimating the posture of the weak texture three-dimensional object, which comprises the following components: the three-dimensional object model rendering module is used for rendering the three-dimensional object model into a plane image according to the target background image, the three-dimensional object model and the corresponding gesture; the three-dimensional image rendering method is specifically used for rendering different postures of a three-dimensional object model into a planar image according to an input three-dimensional object model file, a posture and planar image background information to form combined data of different posture visual angles and different backgrounds; when different postures of the three-dimensional object model are rendered to a plane image, acquiring truth value data required by neural network training, wherein the truth value data comprises: full-automatic data obtained by utilizing the rendering and the tracked semi-automatic semi-manual marking are utilized; the model training module is used for carrying out deep neural network training; the model trainingThe training module is also used for dividing the plane image into a plurality of grids; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data; in the training process, calculating the loss of the object detection model; the attitude estimation module is used for carrying out object detection recognition and attitude estimation on the input planar image by utilizing the trained detection model file; the system is also used for carrying out object detection and identification on the plane image by utilizing the trained detection model file and estimating the initial posture of the identified object; the pose optimization module is used for optimizing the estimated object pose; and the system is further configured to optimize the estimated object pose by an error energy function, where the error energy function is: e ═ pi (o)i,gcam,obj)-xi,|2Wherein, pi (o)i,gcam,obj) As a projection function of the edge sampling points of the object, oiAs edge sampling points, gcam,objFor projective transformation, x, of an object coordinate system to a camera plane coordinate systemi’Representing the nearest edge point coordinate from the point projection point i' on the planar image.
The embodiment of the present invention further provides a device for estimating a posture of a three-dimensional object with weak texture, including a memory and a processor, wherein: the memory is used for storing codes and documents; a processor is used to execute the code and documents stored in the memory to implement the method of weak texture three-dimensional object pose estimation as previously described.
Compared with the traditional computer vision method which is difficult to perceive the position of an object in a plane image, the method and the device for estimating the posture of the weak texture three-dimensional object provided by the embodiment of the invention fuse the deep neural network and the traditional computer vision method, can realize the estimation of the initial posture of the existing object through the deep neural network on the premise of not acquiring the depth image, and further optimize the initial posture, thereby obtaining the result with higher precision and robustness.
In addition, for the acquisition of the marking data, a full-automatic computer rendering mode can be adopted in the embodiment of the invention, so that the marking labor cost can be greatly saved on the premise of ensuring the data precision, and the high-precision posture data with six degrees of freedom can be obtained. The method can also adopt a mode of manually collecting data and combining semi-automatic marking, not only can acquire real data, but also can overcome the problem that rendering data is inconsistent with a real scene, thereby improving the perception and detection capability of the model to the real physical world.
Therefore, the method and the device have the advantages that the deep neural network model is utilized to estimate the pose of the object in real time, the labor cost for marking is saved, and the precision and the robustness of the estimation of the weak texture three-dimensional object pose are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a schematic flow chart of a method for estimating a posture of a three-dimensional object with weak texture according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of deep neural network training according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of estimating a planar image to be detected according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of an apparatus for estimating a pose of a three-dimensional object with weak texture according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of another apparatus for estimating a pose of a three-dimensional object with weak texture according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings of the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
The embodiment of the invention provides a method for estimating the posture of a three-dimensional object with weak texture, which mainly comprises deep neural network training, three-dimensional object recognition, three-dimensional object segmentation and posture estimation and three-dimensional object posture adjustment and optimization. As shown in fig. 1, the method specifically comprises the following steps:
step 1, importing a three-dimensional object model, rendering the three-dimensional object model into a plane image according to a target background image, the three-dimensional object model and a corresponding posture, and performing deep neural network training.
The detailed mode of the step can be seen in fig. 2.
Step 101, inputting a three-dimensional object model file, a posture and target background image information.
In this embodiment, if the three-dimensional object model is known, the three-dimensional object model file can be imported directly. If no three-dimensional object model file exists currently, the three-dimensional object model file can be generated through external scanning and other modes. The target background image information is used for projecting and superposing the subsequently rendered object on the plane background so as to simulate a real environment. In the gesture generating process, because the actual object is usually concentrated on several common viewing angles, before the gesture is generated, the gesture of the object may be counted first, and a corresponding gesture is generated based on the statistical analysis result of the gesture of the actual object, for example, a six-degree-of-freedom gesture (R, t) corresponding to the probability distribution is generated.
And 102, rendering different postures of the three-dimensional object model into a target background image to form combined data of different posture visual angles and different target backgrounds.
In this embodiment, the three-dimensional object rendering module renders different poses of the three-dimensional object model into the target background image according to the input three-dimensional object model file, the poses and the target background image information. The rendering projection of the three-dimensional object model is a basic function of industry passing, and the innovation point of the embodiment is that the three-dimensional object model and different postures are rendered on a random background image to form a series of combined data of different posture visual angles and different backgrounds, so that all possible situations in daily life are covered as much as possible.
And 103, acquiring true value data required by neural network training when different postures of the three-dimensional object model are rendered to the plane image.
In this embodiment, when rendering different poses of the three-dimensional object model to a planar image, the truth data required for neural network training is obtained, where the truth data includes: full-manual labeling, semi-automatic and semi-manual labeling by utilizing tracking, and full-automatic data acquired by rendering. The annotation refers to marking the current image or data with a corresponding label.
Because the labeling difficulty of a single image is higher, the method for tracking the three-dimensional object model is adopted for continuous labeling, so that the labeling cost can be saved. Specifically, by using a three-dimensional object model tracking module, for example, an Edge Distance Field (EDF) -based method is adopted, the posture of the object can be tracked and output in real time, and semi-automatic labeling is realized by recording the tracked image and the posture of the object in real time.
Or, firstly, manually aligning the initial posture of the three-dimensional object model, namely projecting a virtual three-dimensional object onto a plane image, and manually adjusting the position of a current camera to enable an actual object in the image to be superposed with the virtual object to serve as an initial posture; and then calling a pose optimization module to track the three-dimensional object model in real time, and outputting the six-degree-of-freedom pose of the object in an off-line or on-line mode in the tracking process to perform semi-automatic and semi-manual marking.
And 104, training a deep neural network model according to the acquired truth value data.
In this embodiment, a computer may be used to render an image for pre-training a deep neural network model, where full-automatic data obtained by rendering is used as an input sample of the deep neural network for training, the precision needs to be determined according to an actual situation, and the general determination standard is as follows: if the input samples are added and the actual detection rate is not obviously improved (for example, a detection rate improvement threshold value is set, and if the detection rate improvement threshold value is lower than the threshold value, the detection rate improvement is not obvious), the process of pre-training by utilizing the computer-rendered image is stopped, and then the deep neural network model is further trained by adopting the marking data of full-manual marking and/or semi-automatic semi-manual marking instead.
In the actual training process, if the whole image is used for training, the model is difficult to converge. Therefore, in the present embodiment, it is preferable to divide the two-dimensional image into a plurality of meshes, and return the coordinates of the center point in each mesh and the corresponding confidence probability (confidence probability) to perform model training. The annotation data input to the model training module may include a planar image, a corresponding three-dimensional object ID, center coordinates (x, y), and object sash size (Wx, Wy). The three-dimensional object ID is automatically numbered according to the sequence of the model file, and the function is to compare the returned value of the target object. The center coordinates (x, y) represent the center of the target detection frame. The object cell size (Wx, Wy) represents the detection cell size.
Specifically, the input image is divided into an S × S grid. If the center of an object falls within the grid, the grid is responsible for detecting the object. Each mesh predicts B (which may be 2) bounding boxes, each corresponding to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, where x, y represent the center point coordinates of the mesh, w, h are the predicted width and height relative to the entire image, the confidence probability is whether the object is contained and the degree of positional accuracy, and the object id and pose orientation are determined by the confidence probability.
The calculation formula of the confidence probability is as follows: (Pr (object)) IOU (pred | truth)) reflecting the accuracy of the predicted location and the probability of the presence of the object in combination, wherein (Pr (object) reflects the likelihood of the object being present in the mesh bounding box based on the current model and (IOU (pred | truth)) reflects the accuracy of the location of the predicted object in the mesh bounding box. If there is no object in the bounding box, the confidence probability should be 0; if so, the confidence probability is the IOU (interaction over Unit) between the predicted bounding box and the actual bounding box. And if the center of the object falls into the boundary box, labeling the five prediction parameters of the object and the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data.
In the training process, the loss of the object detection model is also considered, and the loss function of the object model is as follows:
Loss=λobj*Lossobjwidth*Losswidthshift*Lossshift
wherein λ isobj、λwidthAnd λshiftTo adjust the coefficient, λobjFor target object detection, λwidthDetecting the frame width, λ, for the target objectshiftFrame displacement is detected for the target object.
LossobjFor the object classification loss, the concrete formula is as follows:
Lossobj=∑iProb(obji)log(Prob(obji) Wherein, Prob (obj)i) Representing the confidence probability of the object i.
LosswidthFor the object bounding box loss, the specific formula is:
Figure GDA0002422858570000101
where Wxi represents the predicted box width for object i, Wyi represents the predicted box height for object i,
Figure GDA0002422858570000102
representing the true value of the predicted box width for object i,
Figure GDA0002422858570000103
representing the true value of the predicted box height for object i.
LossshiftFor the object center offset loss, the specific formula is:
Figure GDA0002422858570000104
wherein 1 isiWhen the classification function returns the corresponding object id, the classification function is 1, otherwise, the classification function is 0;
xi represents the x coordinate of the center point of the prediction frame of the object i, yi represents the y coordinate of the center point of the prediction frame of the object i,
Figure GDA0002422858570000105
which represents the true value of the x-coordinate,
Figure GDA0002422858570000106
representing the true value of the y coordinate.
Step 105, evaluating the test set, and returning to step 104 if the test set fails; if so, step 106 is performed, i.e., the inspection model file is output.
In this step, the test set is similar in form to the training set described above, except for the purpose. The training set is used to train the network model, while the test set is used to evaluate the neural network performance. A test threshold is preset, which is typically based on the IOU, which is a criterion that measures the accuracy of detecting a corresponding object in a particular data set.
If the threshold value is not reached, the model fails, and the deep neural network model is trained after the labeled data is added.
If the threshold value is reached, outputting a detection model file if the threshold value is passed, wherein the model detection file can be the weight value of each node of the deep neural network.
And 2, carrying out object detection recognition and posture estimation on the input plane image by using the trained detection model file, and optimizing the estimated object posture.
The detailed mode of the step can be seen in fig. 3.
Step 201, inputting a planar image to be detected.
In this embodiment, the input data may be a planar image to be detected after distortion removal. In general, real cameras have certain distortion during shooting and imaging, and the distortion removal can be to eliminate the distortion on the image by a camera calibration method.
And 202, carrying out object detection recognition and attitude estimation on the plane image by using the trained detection model file.
In the embodiment, the trained detection model file is used to perform object detection and recognition on the plane image, and the initial posture of the recognized object is estimated, that is, the input of the detection model file is the plane image, and the plane image is processed by the internal neural network and output as the three-dimensional object ID, the center coordinates (x, y), and the object frame size Wx, Wy in the estimated plane image. The object posture can be estimated through the central point coordinate and the size of the object frame, wherein the object posture comprises six degrees of freedom, xyz and three degrees of freedom of an angle, xy can be solved through the central point coordinate, the object depth z can be solved according to the size of the frame, and the angle is solved through the orientation of the object output by the neural network.
And 203, optimizing the pose of the object and determining the posture of the three-dimensional object.
In this embodiment, the neural network can only make one estimation on the initial pose of the object, and there is still an error between the actual pose and the initial pose, so that further optimization on the estimated pose of the object is required. Preferably, the object pose can be further optimized by using methods such as an edge distance field, and the error energy function is a reprojection error, specifically:
E=|π(oi,gcam,obj)-xi|2
wherein, pi (o)i,gcam,obj) As a projection function of the edge sampling points of the object, oiAs edge sampling points, gcam,objFor projective transformation, x, of an object coordinate system to a camera plane coordinate systemiRepresenting the coordinates of the nearest edge point on the planar image from the point projection point i.
As shown in fig. 4, an embodiment of the present invention further provides an apparatus for estimating a pose of a three-dimensional object with weak texture, including:
a three-dimensional object model rendering module 401, configured to render a three-dimensional object model into a planar image according to a target background image, the three-dimensional object model, and a corresponding gesture;
a model training module 402 for performing deep neural network training;
an attitude estimation module 403, configured to perform object detection and identification and attitude estimation on the input planar image by using the trained detection model file;
and a pose optimization module 404, configured to optimize the estimated object pose.
Wherein the content of the first and second substances,
the three-dimensional object model rendering module 401 is specifically configured to render different poses of the three-dimensional object model into a planar image according to the input information of the three-dimensional object model file, the poses and the target background image, and obtain truth data required by neural network training when combined data of different pose viewing angles and different target backgrounds is rendered into the planar image in different poses of the three-dimensional object model, where the truth data includes: full-manual labeling, semi-automatic and semi-manual labeling by utilizing tracking, and full-automatic data obtained by rendering;
the model training module 402 is specifically configured to pre-train a deep neural network model by using full-automatic data obtained by rendering as an input sample, wherein in an initial stage, as training samples increase, the detection rate of the model is improved significantly, and then as the number of samples increases, the detection rate is improved and gradually slowed down, and when the actual detection rate increases and is lower than a set threshold value as the input samples increase, the deep neural network model is further trained by using labeled data of full-manual labeling and/or semi-automatic semi-manual labeling; in addition, the model training module 402 is further configured to divide the planar image into a plurality of grids; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data; in the training process, the loss of the object detection model is calculated, and the specific formula of the loss function is the same as that of the method for estimating the weak texture three-dimensional object posture, so detailed description is omitted.
The attitude estimation module 403 is specifically configured to perform object detection and identification on the planar image by using the trained detection model file, and estimate an initial attitude of the identified object;
the pose optimization module 404 is specifically configured to optimize the estimated object pose through an error energy function, where a specific formula of the error energy function is the same as that of the weak texture three-dimensional object pose estimation method, and thus details are not repeated.
The specific technical details of the device for estimating the posture of the weak texture three-dimensional object are similar to the method for estimating the posture of the weak texture three-dimensional object, and therefore detailed description is omitted.
As shown in fig. 5, an apparatus for estimating a pose of a three-dimensional object with weak texture according to an embodiment of the present invention includes a memory and a processor, where:
a memory 501 for storing codes and documents;
a processor 502 for executing the code and documents stored in the memory for implementing the method for weak texture three-dimensional object pose estimation as described previously.
The specific technical details of the device for estimating the posture of the weak texture three-dimensional object are similar to the method for estimating the posture of the weak texture three-dimensional object, and therefore detailed description is omitted.
Therefore, the depth neural network and the traditional computer vision method are fused, the estimation of the initial posture of the existing object can be realized through the depth neural network on the premise that the depth image is not obtained, the initial posture is further adjusted and optimized, and the result with higher precision and robustness can be obtained. In addition, the implementation mode of the invention can adopt a full-automatic computer rendering mode, so that the manpower cost for marking can be greatly saved on the premise of ensuring the data precision, and meanwhile, the high-precision posture data with six degrees of freedom can be obtained. The method can also adopt a mode of manually collecting data and combining semi-automatic marking, not only can acquire real data, but also can overcome the problem that rendering data is inconsistent with a real scene, thereby improving the perception and detection capability of the model to the real physical world.
Those skilled in the art will understand that all or part of the steps in the method according to the above embodiments may be implemented by a program instructing related hardware to complete, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps in the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on the differences from the other embodiments.
Finally, it should be noted that: the foregoing description of various embodiments of the invention is provided to those skilled in the art for the purpose of illustration. It is not intended to be exhaustive or to limit the invention to a single disclosed embodiment. Various alternatives and modifications of the invention, as described above, will be apparent to those skilled in the art. Thus, while some alternative embodiments have been discussed in detail, other embodiments will be apparent or relatively easy to derive by those of ordinary skill in the art. The present invention is intended to embrace all such alternatives, modifications, and variances which have been discussed herein, and other embodiments which fall within the spirit and scope of the above application.

Claims (8)

1. A method for estimating the posture of a weak texture three-dimensional object is characterized by comprising the following steps:
importing a three-dimensional object model, rendering the three-dimensional object model into a plane image according to a target background image, the three-dimensional object model and a corresponding posture, and performing deep neural network training;
the importing the three-dimensional object model, rendering the three-dimensional object model into a planar image according to the three-dimensional object model and the corresponding posture, specifically includes:
inputting a three-dimensional object model file, a posture and target background image information;
rendering different postures of the three-dimensional object model into a planar image according to the input three-dimensional object model file, the postures and the target background image information to form combined data of different posture visual angles and different target backgrounds;
when different postures of the three-dimensional object model are rendered to a plane image, acquiring truth value data required by neural network training, wherein the truth value data comprises: full-automatic data obtained by utilizing the rendering and the tracked semi-automatic semi-manual marking are utilized;
the deep neural network training specifically comprises: dividing the planar image into a plurality of meshes; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data;
the method comprises the following steps of carrying out object detection recognition and posture estimation on an input plane image by utilizing a trained detection model file, and optimizing the estimated object posture, and specifically comprises the following steps: carrying out object detection and identification on the plane image by using the trained detection model file, and estimating the initial posture of the identified object; the estimated object attitude is adjusted and optimized through an error energy functionThe error energy function is: e ═ pi (o)i,gcam,obj)-xi,|2Wherein, pi (o)i,gcam,obj) As a projection function of the edge sampling points of the object, oiAs edge sampling points, gcam,objFor projective transformation, x, of an object coordinate system to a camera plane coordinate systemi’Representing the nearest edge point coordinate from the point projection point i' on the planar image.
2. The method for estimating the posture of a weakly-textured three-dimensional object according to claim 1, wherein the tracked semi-automatic semi-manual labeling is specifically as follows:
tracking the three-dimensional object model in real time, and performing semi-automatic semi-manual labeling by recording the tracked image and the object posture in real time; alternatively, the first and second electrodes may be,
firstly, aligning the initial posture of a three-dimensional object model, then tracking the three-dimensional object model in real time through pose optimization, outputting the six-degree-of-freedom posture of the object in the tracking process, and carrying out semi-automatic semi-manual labeling;
wherein labeling refers to the process of labeling an image or data.
3. The method for estimating the pose of a weakly-textured three-dimensional object according to claim 1, wherein the deep neural network training specifically comprises:
the method comprises the steps that full-automatic data obtained by rendering are used as input samples to pre-train a deep neural network model, in the initial stage, along with the increase of training samples, the detection rate of the model is improved obviously, then along with the increase of the number of samples, the detection rate is improved and gradually slowed down, and when the number of the input samples is increased and the actual detection rate is improved and is lower than a set threshold value, the deep neural network model is further trained by adopting marking data of full-manual marking and/or semi-automatic semi-manual marking.
4. The method of weak texture three-dimensional object pose estimation according to claim 3, further comprising:
in the training process, calculating the loss of the object detection model, wherein the loss function is as follows:
Loss=λobj*Lossobjwidth*Losswidthshift*Lossshift
wherein λ isobj、λwidthAnd λshiftTo adjust the coefficient, λobjFor target object detection, λwidthDetecting the frame width, λ, for the target objectshiftDetecting frame displacement for the target object;
Lossobjfor the object classification loss, the concrete formula is as follows:
Lossobj=∑iProb(obji)log(Prob(obji) Wherein, Prob (obj)i) Representing the confidence probability of the object i;
Losswidthfor the object bounding box loss, the specific formula is:
Figure FDA0002422858560000021
where Wxi represents the predicted box width for object i, Wyi represents the predicted box height for object i,
Figure FDA0002422858560000022
representing the true value of the predicted box width for object i,
Figure FDA0002422858560000023
representing a true value of the predicted box height for object i;
Lossshiftfor the object center offset loss, the specific formula is:
Figure FDA0002422858560000031
wherein 1 isiWhen the classification function returns the corresponding object id, the classification function is 1, otherwise, the classification function is 0; x is the number ofiRepresenting the x-coordinate, y of the center point of the prediction box of the object iiRepresenting the y coordinate of the center point of the prediction frame of the object i,
Figure FDA0002422858560000032
which represents the true value of the x-coordinate,
Figure FDA0002422858560000033
representing the true value of the y coordinate.
5. An apparatus for estimating the pose of a weakly textured three-dimensional object, comprising:
the three-dimensional object model rendering module is used for rendering the three-dimensional object model into a plane image according to the target background image, the three-dimensional object model and the corresponding gesture; the three-dimensional image rendering method is specifically used for rendering different postures of a three-dimensional object model into a planar image according to an input three-dimensional object model file, a posture and planar image background information to form combined data of different posture visual angles and different backgrounds; when different postures of the three-dimensional object model are rendered to a plane image, acquiring truth value data required by neural network training, wherein the truth value data comprises: full-automatic data obtained by utilizing the rendering and the tracked semi-automatic semi-manual marking are utilized;
the model training module is used for carrying out deep neural network training; the model training module is also used for dividing the plane image into a plurality of grids; and predicting a bounding box in each grid, wherein each bounding box corresponds to 5 prediction parameters: x, y, w, h and corresponding object confidence probabilities, wherein x and y represent coordinates of a central point of the grid, w and h are widths and heights predicted relative to the whole image, the confidence probability is the degree of accuracy of the position of whether an object is contained or not, and the id and the attitude orientation of the object are determined through the confidence probabilities; if the center of the object falls in the boundary box, the grid is responsible for detecting the object, labeling the object and the prediction parameters of the boundary box, and performing object detection recognition and attitude estimation training according to the labeled data; in the training process, calculating the loss of the object detection model;
the attitude estimation module is used for carrying out object detection recognition and attitude estimation on the input planar image by utilizing the trained detection model file; the system is also used for carrying out object detection and identification on the plane image by utilizing the trained detection model file and estimating the initial posture of the identified object;
the pose optimization module is used for optimizing the estimated object pose; and the system is further configured to optimize the estimated object pose by an error energy function, where the error energy function is: e ═ pi (o)i,gcam,obj)-xi,|2Wherein, pi (o)i,gcam,obj) As a projection function of the edge sampling points of the object, oiAs edge sampling points, gcam,objFor projective transformation, x, of an object coordinate system to a camera plane coordinate systemi’Representing the nearest edge point coordinate from the point projection point i' on the planar image.
6. The apparatus for estimating pose of a weakly-textured three-dimensional object according to claim 5, wherein the model training module is specifically configured to perform pre-training of the deep neural network model by using fully-automatic data obtained by rendering as input samples, in an initial stage, as the number of training samples increases, the detection rate of the model is significantly improved, and then as the number of samples increases, the improvement of the detection rate gradually slows down, and when the number of input samples increases and the actual detection rate is improved to be lower than a set threshold value, the deep neural network model is further trained by using labeled data labeled by fully-manual labeling and/or semi-automatic semi-manual labeling instead.
7. The apparatus for estimating pose of weak texture three-dimensional object according to claim 6, wherein the model training module further calculates loss of the object detection model during training, the loss function is:
Loss=λobj*Lossobjwidth*Losswidthshift*Lossshift
wherein λ isobj、λwidthAnd λshiftTo adjust the coefficient, λobjFor target object detection, λwidthDetecting the frame width, λ, for the target objectshiftDetecting frame displacement for the target object;
Lossobjfor the object classification loss, the concrete formula is as follows:
Lossobj=∑iProb(obji)log(Prob(obji) Wherein, Prob (obj)i) Representing the confidence probability of the object i;
Losswidthfor the object bounding box loss, the specific formula is:
Figure FDA0002422858560000041
where Wxi represents the predicted box width for object i, Wyi represents the predicted box height for object i,
Figure FDA0002422858560000042
representing the true value of the predicted box width for object i,
Figure FDA0002422858560000043
representing a true value of the predicted box height for object i;
Lossshiftfor the object center offset loss, the specific formula is:
Figure FDA0002422858560000044
wherein 1 isiWhen the classification function returns the corresponding object id, the classification function is 1, otherwise, the classification function is 0; x is the number ofiRepresenting the x-coordinate, y of the center point of the prediction box of the object iiRepresenting the y coordinate of the center point of the prediction frame of the object i,
Figure FDA0002422858560000046
which represents the true value of the x-coordinate,
Figure FDA0002422858560000045
representing the true value of the y coordinate.
8. An apparatus for weak texture three-dimensional object pose estimation, the apparatus comprising a memory and a processor, wherein:
the memory is used for storing codes and documents;
the processor for executing the code and documents stored in the memory for implementing the method steps of any of claims 1 to 4.
CN201910168783.7A 2019-03-06 2019-03-06 Method and device for estimating posture of weak texture three-dimensional object Active CN109934847B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910168783.7A CN109934847B (en) 2019-03-06 2019-03-06 Method and device for estimating posture of weak texture three-dimensional object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910168783.7A CN109934847B (en) 2019-03-06 2019-03-06 Method and device for estimating posture of weak texture three-dimensional object

Publications (2)

Publication Number Publication Date
CN109934847A CN109934847A (en) 2019-06-25
CN109934847B true CN109934847B (en) 2020-05-22

Family

ID=66986438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910168783.7A Active CN109934847B (en) 2019-03-06 2019-03-06 Method and device for estimating posture of weak texture three-dimensional object

Country Status (1)

Country Link
CN (1) CN109934847B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276804B (en) * 2019-06-29 2024-01-02 深圳市商汤科技有限公司 Data processing method and device
CN110728222B (en) * 2019-09-30 2022-03-25 清华大学深圳国际研究生院 Pose estimation method for target object in mechanical arm grabbing system
CN110928414A (en) * 2019-11-22 2020-03-27 上海交通大学 Three-dimensional virtual-real fusion experimental system
CN111510701A (en) * 2020-04-22 2020-08-07 Oppo广东移动通信有限公司 Virtual content display method and device, electronic equipment and computer readable medium
CN113643433A (en) * 2020-04-27 2021-11-12 成都术通科技有限公司 Form and attitude estimation method, device, equipment and storage medium
CN111652901B (en) * 2020-06-02 2021-03-26 山东大学 Texture-free three-dimensional object tracking method based on confidence coefficient and feature fusion
CN113822102B (en) * 2020-06-19 2024-02-20 北京达佳互联信息技术有限公司 Gesture estimation method and device, electronic equipment and storage medium
CN112381879A (en) * 2020-11-16 2021-02-19 华南理工大学 Object posture estimation method, system and medium based on image and three-dimensional model
CN113094016B (en) * 2021-06-09 2021-09-07 上海影创信息科技有限公司 System, method and medium for information gain and display
CN113780291A (en) * 2021-08-25 2021-12-10 北京达佳互联信息技术有限公司 Image processing method and device, electronic equipment and storage medium

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003134B1 (en) * 1999-03-08 2006-02-21 Vulcan Patents Llc Three dimensional object pose estimation which employs dense depth information
US7706603B2 (en) * 2005-04-19 2010-04-27 Siemens Corporation Fast object detection for augmented reality systems
US8351646B2 (en) * 2006-12-21 2013-01-08 Honda Motor Co., Ltd. Human pose estimation and tracking using label assignment
CN101499128B (en) * 2008-01-30 2011-06-29 中国科学院自动化研究所 Three-dimensional human face action detecting and tracing method based on video stream
CN102270345A (en) * 2011-06-02 2011-12-07 西安电子科技大学 Image feature representing and human motion tracking method based on second-generation strip wave transform
CN103116902A (en) * 2011-11-16 2013-05-22 华为软件技术有限公司 Three-dimensional virtual human head image generation method, and method and device of human head image motion tracking
US8855366B2 (en) * 2011-11-29 2014-10-07 Qualcomm Incorporated Tracking three-dimensional objects
CN102609680B (en) * 2011-12-22 2013-12-04 中国科学院自动化研究所 Method for detecting human body parts by performing parallel statistical learning based on three-dimensional depth image information
CN104376600B (en) * 2014-11-25 2018-04-17 四川大学 Stabilization threedimensional model tracking based on online management super-resolution block
WO2017156243A1 (en) * 2016-03-11 2017-09-14 Siemens Aktiengesellschaft Deep-learning based feature mining for 2.5d sensing image search
US10373369B2 (en) * 2017-03-16 2019-08-06 Qualcomm Technologies, Inc. Three-dimensional pose estimation of symmetrical objects

Also Published As

Publication number Publication date
CN109934847A (en) 2019-06-25

Similar Documents

Publication Publication Date Title
CN109934847B (en) Method and device for estimating posture of weak texture three-dimensional object
CN108764048B (en) Face key point detection method and device
CN109934115B (en) Face recognition model construction method, face recognition method and electronic equipment
US8331619B2 (en) Image processing apparatus and image processing method
CN104573614B (en) Apparatus and method for tracking human face
CN110580723B (en) Method for carrying out accurate positioning by utilizing deep learning and computer vision
CN110675487B (en) Three-dimensional face modeling and recognition method and device based on multi-angle two-dimensional face
CN105740780B (en) Method and device for detecting living human face
CN108229475B (en) Vehicle tracking method, system, computer device and readable storage medium
US20120321134A1 (en) Face tracking method and device
US20190164312A1 (en) Neural network-based camera calibration
CN107480585B (en) Target detection method based on DPM algorithm
JP2009169934A (en) System and method for recognizing deformable object
JP2010176380A (en) Information processing device and method, program, and recording medium
CN109389105B (en) Multitask-based iris detection and visual angle classification method
US11651581B2 (en) System and method for correspondence map determination
CN110827320B (en) Target tracking method and device based on time sequence prediction
CN112200056B (en) Face living body detection method and device, electronic equipment and storage medium
CN111178170B (en) Gesture recognition method and electronic equipment
JP2014134856A (en) Subject identification device, subject identification method, and subject identification program
CN112560584A (en) Face detection method and device, storage medium and terminal
CN113436251B (en) Pose estimation system and method based on improved YOLO6D algorithm
CN113393524B (en) Target pose estimation method combining deep learning and contour point cloud reconstruction
CN111709269B (en) Human hand segmentation method and device based on two-dimensional joint information in depth image
JP2019012497A (en) Portion recognition method, device, program, and imaging control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant