CN113724330B

CN113724330B - Monocular camera object pose estimation method, system, equipment and storage medium

Info

Publication number: CN113724330B
Application number: CN202111025418.4A
Authority: CN
Inventors: 陈忠伟; 石岩; 王益亮; 邓辉; 李正昊; 李华伟; 赵越
Original assignee: Shanghai Xiangong Intelligent Technology Co ltd
Current assignee: Shanghai Xiangong Intelligent Technology Co ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2024-04-30
Anticipated expiration: 2041-09-02
Also published as: CN113724330A

Abstract

The application relates to a monocular camera object pose estimation method, a system, equipment and a storage medium based on key points, which comprise the steps of obtaining actual size information of an actual object and acquiring an actual object image of the actual object based on a monocular camera; importing an actual object image into a preset specific object detection model, and generating two-dimensional image coordinate data of a specific number of key points; generating a specific number of three-dimensional coordinate data according to the actual size information based on a virtual camera coordinate system of the monocular camera; and generating the current object pose information according to the three-dimensional coordinate data, the camera internal reference data and the two-dimensional image coordinate data based on the PNP principle. According to the method, a specific object detection model is generated in advance according to standard object image data training, manual sample collection is not needed, the problems of insufficient sample collection and difficulty in image marking are solved, the pose of an object is calculated by combining the PNP principle, the effect of positioning the three-dimensional coordinates of the object is achieved, and the efficiency of pose information acquisition is improved.

Description

Monocular camera object pose estimation method, system, equipment and storage medium

Technical Field

The application relates to the technical field of visual positioning, in particular to a monocular camera object pose estimation method, system, equipment and storage medium based on key points.

Background

Pose estimation (Pose estimation) is a very important loop in the field of computer vision. The method has great application in the aspects of controlling by estimating the pose of the robot by using the vision sensor, navigating the robot, enhancing reality and the like.

The basis of this process of pose estimation is to find the correspondence point between the real world and the image projection. And then adopting a corresponding pose estimation method according to the types of the point pairs. Of course, the same type of point pairs also have a division based on algebraic and nonlinear optimization methods, such as direct linear transformation (DIRECT LINEAR Transform, DLT) and beam-balancing (Bundle Adjustment, BA). Whereas the prior art generally refers to the process of estimating pose from known point pairs as solving PnP (pespective-n-point, perspective-n-point).

In the prior art, most of the linear laser or point cloud equipment is used for environment detection, and the defects of high equipment cost and greatly reduced effect under the shielding condition exist.

For this reason, the prior art proposes a method and a device for estimating the pose of an object by a monocular camera based on deep learning (patent application publication number: CN 109816725A), wherein the method comprises 1) generating a training set and a verification set according to the projection of the three-dimensional image of the obtained object in a two-dimensional space, the object coordinates corresponding to the projection, and the tag file of the object; 2) Learning the training set by using the cascade convolutional neural network model, and iterating the super-parameters; 3) And testing the trained cascade convolution neural network model by using a test set, and estimating the pose of the object by using the trained cascade convolution neural network model when the accuracy of the trained cascade convolution neural network model is not less than a first preset threshold.

However, a disadvantage of this prior art is that the sample preparation for learning is single and is too different from the actual environment, and the deep learning network used by this method is estimated to be a rough pose, and further optimization is required by ICP (ITERATIVE CLOSEST POINT ) algorithm.

Disclosure of Invention

The main object of the present invention is to provide a method, a system, a device and a storage medium for estimating the pose of an object of a monocular camera based on key points, so as to improve the drawbacks of the prior art in the background art.

In order to achieve the above object, according to a first aspect of the present invention, there is provided a monocular camera object pose estimation method based on key points, the method comprising:

step S100: acquiring actual size information of an actual object and acquiring an actual object image of the actual object based on a monocular camera, wherein camera internal reference data are obtained after calibration based on the actual size information;

Step S200: the actual object image is imported into a preset specific object detection model, and two-dimensional image coordinate data of a specific number of key points are generated, wherein the specific object detection model is generated by training according to standard object image data in advance;

step S300: generating a specific number of three-dimensional coordinate data according to the actual size information based on a virtual camera coordinate system of the monocular camera;

step S400: and generating current object pose information according to the three-dimensional coordinate data, the camera internal reference data and the two-dimensional image coordinate data based on a PNP principle.

Specifically, step S200: the actual object image is imported into a preset specific object detection model, and two-dimensional image coordinate data of a specific number of key points are generated, wherein the specific object detection model is generated in advance according to standard object image data training, and the method further comprises the following steps:

step S201: obtaining object model data of a preset standard object in a specific preset environment, wherein the specific preset environment comprises a plurality of refinement model environments, and each refinement model environment is an environment formed by combining a plurality of environment backgrounds, environment illumination and camera view angles;

Step S202: rendering the object model data image and generating a standard two-dimensional sample image;

Step S203: scaling the standard two-dimensional sample image to a specific scale size, and setting a training data set and a test data set according to a specific number scale;

step S204: training a preset initial detection model based on the training data set, testing the trained initial detection model according to the testing data set after training, and generating a specific object detection model after testing.

Specifically, step S200: importing the actual object image into a preset specific object detection model, and generating two-dimensional image coordinate data of a specific number of key points specifically comprises:

step S210: importing the actual object image into a preset specific object detection model, and scaling the actual object image to a size matched with the standard two-dimensional sample image;

step S220: and generating a specific number of two-dimensional image coordinate data according to the scaled actual object image.

Specifically, step S202: rendering the object model data image and generating a standard two-dimensional sample image, further comprising:

and presetting a specific number and key points at specific positions according to the object model data.

Specifically, the specific location includes: the specific number is the sum of the number of the corner points and the center points of the object model.

In order to achieve the above object, according to a second aspect of the present invention, there is also provided a monocular camera object pose estimation system based on key points, the system comprising:

the information acquisition module is used for acquiring the actual size information of the actual object and acquiring an actual object image of the actual object based on the monocular camera, wherein camera internal reference data are obtained after calibration based on the actual size information;

The image importing module is used for importing the actual object image into a preset specific object detection model and generating two-dimensional image coordinate data of a specific number of key points, wherein the specific object detection model is generated by training according to standard object image data in advance;

The virtual camera module is used for generating a specific number of three-dimensional coordinate data according to the actual size information based on a virtual camera coordinate system of the monocular camera;

The pose generation module is used for generating the pose information of the current object according to the three-dimensional coordinate data, the camera internal reference data and the two-dimensional image coordinate data based on the PNP principle

Specifically, the system further comprises:

The system comprises a refinement model module, a camera view angle module and a camera view angle module, wherein the refinement model module is used for acquiring object model data of a preset standard object in a specific preset environment, wherein the specific preset environment comprises a plurality of refinement model environments, and each refinement model environment is formed by combining a plurality of environment backgrounds, environment illumination and camera view angles;

the image rendering module is used for rendering the object model data image and generating a standard two-dimensional sample image;

the image scaling module is used for scaling the standard two-dimensional sample image to a specific scale and setting a training data set and a test data set according to a specific quantity scale;

The model training module is used for training a preset initial detection model based on the training data set, testing the trained initial detection model according to the testing data set after training is completed, and generating a specific object detection model after testing is completed.

Specifically, the system further comprises:

the real object module is used for importing the real object image into a preset specific object detection model and scaling the real object image to a size matched with the standard two-dimensional sample image;

And the specific number module is used for generating specific number of two-dimensional image coordinate data according to the scaled actual object image.

In order to achieve the above object, according to a third aspect of the present invention, there is also provided a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the above-mentioned method for estimating object pose of a monocular camera based on key points when the processor executes the computer program.

In order to achieve the above object, according to a fourth aspect of the present invention, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps described in the above-described keypoint-based monocular camera object pose estimation method.

The invention has the technical effects that:

According to the monocular camera object pose estimation method based on the key points, the actual size information of the actual object and the actual object image of the actual object are acquired based on the monocular camera in sequence, wherein camera internal reference data are obtained after calibration based on the actual size information; the actual object image is imported into a preset specific object detection model, and two-dimensional image coordinate data of a specific number of key points are generated, wherein the specific object detection model is generated by training according to standard object image data in advance; generating a specific number of three-dimensional coordinate data according to the actual size information based on a virtual camera coordinate system of the monocular camera; the method has the advantages that the current object pose information is generated according to the three-dimensional coordinate data, the camera internal reference data and the two-dimensional image coordinate data based on the PNP principle, namely, a specific object detection model is generated by training according to standard object image data in advance, a large number of training samples are preset, manual sample collection is not needed, the problems of insufficient sample collection and difficulty in picture marking are solved, the pose of an object is calculated by combining the PNP principle, the effect of positioning the three-dimensional coordinate of the object is achieved, and the pose information acquisition efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application. In the drawings:

FIG. 1 is a flow chart of a method for estimating object pose of a monocular camera based on key points in one embodiment;

FIG. 2 is a block diagram of a system for estimating object pose of a monocular camera based on keypoints in one embodiment;

FIG. 3 is an internal block diagram of a computer device in one embodiment;

FIG. 4 is a diagram of the transformation relationship between coordinate systems in a camera imaging system, in one embodiment.

FIG. 5 is an example of rendering an image of the object model data and generating a standard two-dimensional sample image in one embodiment;

FIG. 6 is an example of rendering the object model data image and generating a standard two-dimensional sample image in one embodiment.

Detailed Description

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, based on the embodiments of the invention, which are obtained without inventive effort by a person of ordinary skill in the art, shall fall within the scope of the invention.

It should be noted that, in the description and claims of the present invention and the above figures, the terms "step S100", "step S200", and the like are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion.

As shown in fig. 1, the method for estimating the object pose of the monocular camera based on the key points according to the present invention, in a preferred embodiment, includes the steps of:

specifically, the actual object image is an image acquired by a common RGB camera.

Further, after the actual size information of the actual object is obtained, camera internal reference data of the monocular camera can be obtained by a person skilled in the art according to calibration of the monocular camera.

In addition, the camera internal parameter data depends on the internal parameters of the monocular camera, and after the monocular camera is selected, the internal parameters corresponding to the monocular camera can be obtained.

Specifically, the specific object detection model is an object detection model CENTERNET generated after training in advance. The specific number is nine.

Further, CENTERNET model framework is an anchor-free target detection model, CENTERNET does not need NMS post-processing, and training is simplified.

CENTERNET are compatible with a variety of basic models, including ResNet, hourglass, DLA. For the followingIs to generate the input image of the target/>；

Wherein W, H is the width and height of the image, and R is the output size scale;

C represents the number of key point types, and in the target detection task, represents the number of categories of the target.

To achieve the objective detection task, the CENTERNET network model contains a number of partial optimization objectives, including, for example, image thermal losses, local offset losses at the center point, and dimensional offset losses at the objective box.

Further, the loss function of pixel logistic regression of thermodynamic diagrams is as follows:

Wherein the method comprises the steps of Target output through activation function,/>Is a gaussian distribution of key points.

For the true key point c, its position is,

For the down-sampled keypoints, α, β are the hyper-parameters of the loss function.

The loss function of the image downsampling center point offset is as follows:

Wherein the method comprises the steps of Is a local offset;

Total network training target Loss:

Wherein the method comprises the steps of To adjust the coefficients, default settings are 0.1 and 1.

The object detection aims at detecting the object type and the boundary box position in the image, the CENTERNET network outputs thermodynamic diagrams of each type, and the peak point needs to be extracted to obtain the center position of the boundary box.

And comparing all response points of the thermodynamic diagram with surrounding adjacent points (8), if the response points are larger than or equal to the surrounding adjacent points, reserving the points, and finally reserving the first N peak points meeting the requirements.

The resulting bounding box:

Wherein the method comprises the steps of Is set/>Is a key point of/>Representing the local bias of the point, wi and h _i represent the width and height of the bounding box predicted by the point.

It should be noted that CENTERNET is a mature technology, the foregoing is merely an example, and those skilled in the art should be familiar with and grasp the principle, and after setting a specific number of key points, correspondingly generate two-dimensional image coordinate data.

specifically, the world coordinates point Conversion to image coordinates/>The formula of (2) is as follows:

Wherein the method comprises the steps of Is a coefficient,/>Is an internal parameter and an external parameter of the camera.

Under the condition that the parameters inside and outside the camera are known, the three-dimensional coordinate point corresponds to the unique two-dimensional image coordinate.

That is, conversion of three-dimensional coordinate data with two-dimensional image coordinate data and the actual size information can be achieved by the above formula.

Specifically, the camera intrinsic data can be obtained after the monocular camera is used.

Further, in the camera imaging system, four coordinate systems are included in total: world coordinate system, camera coordinate system, image coordinate system and pixel coordinate system.

The world coordinate system, the camera coordinate system, the image coordinate system, and the pixel coordinate system are mutually convertible. The specific conversion relationship is shown in fig. 4.

In addition, the conversion between the above coordinate systems is prior art, and the present application is not specifically described.

On the other hand, when the current object pose information is generated through calculation, the current object pose information can be calculated by a person skilled in the art according to knowledge and by using a calculation function of an OpenCV algorithm library, and the present application is not specifically described.

In one embodiment, step S200: the actual object image is imported into a preset specific object detection model, and two-dimensional image coordinate data of a specific number of key points are generated, wherein the specific object detection model is generated in advance according to standard object image data training, and the method further comprises the following steps:

Specifically, this step is carried out based on Blender, in order to guarantee the variety of sample picture, through setting up the standard object of a plurality of quantity to cooperate to build multiple environment scene, a plurality of needs cooperate to build multiple environment scene promptly. A suitable texture map and background picture are set for the standard object.

Specifically, the specific ratio size is 640X384. The specific number ratio is an 8:2 ratio, i.e. divided into a training data set and a test data set according to the 8:2 ratio.

In one embodiment, step S200: importing the actual object image into a preset specific object detection model, and generating two-dimensional image coordinate data of a specific number of key points specifically comprises:

specifically, by scaling the actual object image to a size matching the standard two-dimensional sample image, matching and data processing for subsequent processing are achieved quickly.

In another embodiment, step S202: rendering the object model data image and generating a standard two-dimensional sample image, further comprising:

And presetting a specific number and key points at specific positions according to the object model data. The characteristic number and the specific position can be set according to the angular points and the number of the actual object model and the unique center points of the staggered corners. The specific location includes: the specific number is the sum of the number of the corner points and the center points of the object model.

Specifically, as shown in fig. 5 to 6, the shelf in the form of a rectangular parallelepiped in the example has a specific number of nine, that is, eight corner points dispersed and a center point of the eight corner points. By the arrangement, the deviation of the PNP calculation pose can be reduced as much as possible.

In summary, the invention sequentially acquires the actual size information of the actual object and the actual object image of the actual object acquired based on the monocular camera, wherein camera internal reference data is acquired after calibration based on the actual size information; the actual object image is imported into a preset specific object detection model, and two-dimensional image coordinate data of a specific number of key points are generated, wherein the specific object detection model is generated by training according to standard object image data in advance; generating a specific number of three-dimensional coordinate data according to the actual size information based on a virtual camera coordinate system of the monocular camera; the method has the advantages that the current object pose information is generated according to the three-dimensional coordinate data, the camera internal reference data and the two-dimensional image coordinate data based on the PNP principle, namely, a specific object detection model is generated by training according to standard object image data in advance, a large number of training samples are preset, manual sample collection is not needed, the problems of insufficient sample collection and difficulty in picture marking are solved, the pose of an object is calculated by combining the PNP principle, the effect of positioning the three-dimensional coordinate of the object is achieved, and the pose information acquisition efficiency is improved.

In one embodiment, as shown in fig. 2, a system for estimating the pose of an object by a monocular camera based on key points, the system comprising:

In one embodiment, the system further comprises:

In one embodiment, as shown in fig. 3, a computer device includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps described in the above-described method for estimating object pose of a monocular camera based on key points when the processor executes the computer program.

In one embodiment, a computer readable storage medium has stored thereon a computer program which when executed by a processor implements the steps described above for a keypoint-based monocular camera object pose estimation method.

It should be further noted that, it should be understood by those skilled in the art that all or part of the procedures in implementing the methods of the embodiments described above may be implemented by a computer program, which may be stored in a non-volatile computer readable storage medium, and the computer program may include the procedures of the embodiments of the methods described above when executed.

Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SYNCHLINK) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of protection of the present application is to be determined by the appended claims.

Claims

1. The utility model provides a monocular camera object position appearance estimation method based on key point, its characterized in that, the method includes:

step S204: training a preset initial detection model based on the training data set, testing the trained initial detection model according to the testing data set after training is completed, and generating a specific object detection model after testing is completed;

2. The method for estimating the pose of an object by a monocular camera based on key points according to claim 1, wherein step S200: importing the actual object image into a preset specific object detection model, and generating two-dimensional image coordinate data of a specific number of key points specifically comprises:

3. The method for estimating the pose of an object by a monocular camera based on key points according to claim 1, wherein step S202: rendering the object model data image and generating a standard two-dimensional sample image, further comprising: and presetting a specific number and key points at specific positions according to the object model data.

4. The keypoint-based monocular camera object pose estimation method of claim 3, wherein the specific location comprises: the specific number is the sum of the number of the corner points and the center points of the object model.

5. A system for estimating object pose of a monocular camera based on key points, the system comprising:

The pose generation module is used for generating current object pose information according to the three-dimensional coordinate data, the camera internal reference data and the two-dimensional image coordinate data based on a PNP principle;

6. The keypoint-based monocular camera object pose estimation system of claim 5, further comprising:

7. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 4 when the computer program is executed.

8. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 4.