CN113240736A

CN113240736A - Pose estimation method and device based on YOLO6D improved network

Info

Publication number: CN113240736A
Application number: CN202110202464.0A
Authority: CN
Inventors: 钟志强; 陈新度; 吴磊; 刘跃生
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2021-02-23
Filing date: 2021-02-23
Publication date: 2021-08-10

Abstract

The utility model discloses a pose estimation method, a device and a system based on a YOLO6D improved network, which comprises the following steps of modifying 5 convolutional layers of a fifth layer in an original YOLO6D network into 3 convolutional layers, modifying 7 convolutional layers of a sixth layer in the original YOLO6D network into 3 convolutional layers, and modifying 5 maximum pooling layers in the original YOLO6D network into 4 maximum pooling layers and 1 global average pooling layer; the method comprises the steps of obtaining multiple groups of 2D images and 3D models of a target object, wherein the multiple groups of 2D images and 3D models have a 2D-3D corresponding relation, inputting the 2D images and the 3D models into a YOLO6D improved network, predicting 1 central point and 8 corner points of a projection of a boundary box of the 3D models on the 2D images, carrying out pose estimation on the target object through a PnP (pseudo-random projection) pose estimation algorithm, and outputting a pose estimation result meeting an estimation index to obtain a final pose estimation result. The method has the advantage of greatly improving the running speed of the method compared with other algorithms.

Description

Pose estimation method and device based on YOLO6D improved network

Technical Field

The disclosure relates to the technical field of intelligent control, in particular to a pose estimation method and device based on a YOLO6D improved network.

Background

6D object pose estimation has been an important issue in the field of computer vision, and a great deal of research has been devoted to it in the past. Deep Neural Networks (DNNs) exhibit excellent performance in the field of real-time pose estimation, but in order to make DNN networks have better generalization capability, the existing networks themselves are very large and complex in structure, resulting in low computational efficiency and poor real-time performance, and need to be operated on computers with strong computational power and sufficient memory, which is very unfriendly for some situations where computational power is not strong and only single target pose estimation is needed.

The YOLO6D web framework is modified on the YOLO v2 web framework. As shown in FIG. 2, the YOLO V2 network structure has 31 total layers, and the layers 0-22 are Darknet-19 networks, wherein the number of convolutional layers is 19 and the maximum pooling layer number is 5. Starting from layer 23 is an added detection network. 23 layers and 24 layers are convolution layers and 25 layers are fusion layers, the function of the layers is to merge layers, a direct connection layer is added into 27 layers to obtain a fine-grained characteristic of 26 multiplied by 26, then the size of a characteristic diagram of 26 multiplied by 512 is changed into 13 multiplied by 2048, and the characteristic diagram is spliced with the original characteristic diagram of 13 multiplied by 1024, so that multi-scale information is obtained, and the mAP is improved by 1% compared with the YOLOv 1. The size of the 30-layer output is 13 × 13, and the output parameter of each grid is 125, so that the number of finally output parameters is 13 × 13 × 125 in total. The number of output channels of the last convolution layer of YOLOv2 is 5 (number of anchors) × [4(center _ x, center _ y, width, height) +1(confidence) + num _ classes. Wherein anchor represents an anchor frame, center _ x, center _ y, width and height respectively represent the center point coordinate and width of the anchor frame, confidence represents confidence, and num _ classes represents object types. YOLO6D changes the output of the net to 5(anchor number) × [18 (coordinates of 9 vertices in the pixel coordinate system) +1(confidence) + num _ classes.

In the convolutional neural network, convolutional layers are used for extracting deeper feature information, the more convolutional layers, the finer the extracted features, and when one convolutional layer is added, the number of corresponding neurons is increased, so that the parameters of the network are increased, the model becomes more complex, the more complex the model calculation amount is, and the overfitting phenomenon is more likely to occur.

The main functions of the pooling layer are: the first is to remove redundant information and save computing resources. And secondly, retaining the characteristic information of the detected object. And thirdly, reducing the parameter quantity, improving the performance of the model and preventing overfitting. According to the relevant theory, the error of feature extraction mainly comes from two aspects: (1) the variance of the estimated value is increased due to the limited size of the neighborhood; (2) convolutional layer parameter errors cause a shift in the estimated mean. Average pooling can reduce first kind of error, more background information of keeping the image, and maximum pooling and mean pooling have all made down sampling to data, but maximum pooling is inclined to reduce second kind of error, and more texture information that keeps selects the better characteristic of categorised degree of distinguishing, provides the nonlinearity. This is similar to non-maximum suppression, which can suppress noise on the one hand and enhance the saliency of the feature map within the region on the other hand. The average pooling emphasizes that the overall feature information is down-sampled, the number of parameters is reduced preferentially, and the dimension is reduced while the information is more favorably transmitted to the next module for feature extraction.

The network structure of YOLO6D is based on YOLO 2, another version of YOLO9000 of YOLO 2 can detect more than 9000 object classes, and therefore the network structure of YOLO 2 is very large and complex.

Disclosure of Invention

The purpose of the present disclosure is to solve at least one of the deficiencies of the prior art, and to provide a pose estimation method and apparatus based on YOLO6D improved network.

In order to achieve the above object, the present disclosure proposes a pose estimation method for improving a network based on YOLO6D, including the following,

YOLO6D improves the network set-up process,

modifying 5 convolutional layers of a fifth layer in the original YOLO6D network into 3 convolutional layers, modifying 7 convolutional layers of a sixth layer in the original YOLO6D network into 3 convolutional layers, modifying 5 maximum pooling layers in the original YOLO6D network into 4 maximum pooling layers plus 1 global average pooling layer;

the pose estimation process is carried out in the way that,

acquiring a plurality of groups of 2D images and 3D models of the target object with 2D-3D corresponding relation,

inputting the 2D image and a 3D model into a YOLO6D improved network, predicting 1 central point and 8 corner points of the projection of the bounding box of the 3D model on the 2D image,

performing pose estimation on the target object according to 1 central point and 8 angular points of the projection of the bounding box of the 3D model on the 2D image through a PnP pose estimation algorithm,

and outputting the pose estimation result meeting the evaluation index to obtain a final pose estimation result.

Further, the method for building the YOLO6D improved network further comprises the steps of,

adding 1 × 1 convolution kernels between the 3 × 3 convolution kernels in the original YOLO6D network to double the number of channels after each maximum pooling operation.

Further, the method further comprises performing a batch normalization operation on the input image data before improving each layer of the network by YOLO 6D.

Further, the method for predicting 1 center point and 8 corner points of the projection of the bounding box of the 3D model on the 2D image specifically includes the following steps,

the input RGB image, namely 2D image, has the size of 416 x 416, is subjected to 32 times of downsampling processing by using a YOLO6D improved network, the output characteristic size is 13 x 13, the image is divided into a 2D regular grid comprising S x S grids, each grid position in the output 3D tensor is associated with a multi-dimensional vector, the multi-dimensional vector comprises the positions of predicted 1 central point and 9 control points of 8 corner points on the 2D image, the class probability of an object and an overall confidence value, the grid with the highest confidence value score is found as the central point, and the coordinate offset of other 8 corner points is expressed as follows,

g_x＝f(x)+c_x，g_y＝f(y)+c_y，

wherein c is_x，c_yRepresenting the coordinates of the center point. For the center points, f (-) represents the sigmoid function, for the corner points, f (-) represents the identity function,

wherein the predicted confidence values of the 9 control points are calculated as follows,

wherein D_T(x) Defined as the 2D Euclidean distance in image space, is a sharp exponential function with a cutoff value, α is the sharpness of the exponential function, D_thIs a set threshold.

Further, the evaluation index of the YOLO6D improved network is subjected to error evaluation by the following formula,

e_REP＝||p_i-CHμ||₂，

wherein, P_iIs the position of pixel i, μ is the average of the maximum blending weights of the pixel distribution, H is the estimated object pose, C is the camera matrix;

e_TE＝||t-t′||₂，e_kE＝arccos[(Tr(RR′^-1)-1)/2]，

where t and t 'are the predicted translation matrix and the true translation matrix, respectively, and R' are the predicted rotation matrix and the true rotation matrix, respectively, with the error e being represented by the angle of the axis of rotation_kE。

The invention also provides a pose estimation system based on the YOLO6D improved network, which comprises,

YOLO6D improves the network setup module,

for modifying 5 convolutional layers of the fifth layer in the original YOLO6D network into 3 convolutional layers, modifying 7 convolutional layers of the sixth layer in the original YOLO6D network into 3 convolutional layers, modifying 5 maximum pooling layers in the original YOLO6D network into 4 maximum pooling layers plus 1 global average pooling layer to form a YOLO6D improved network;

a pose estimation module for estimating the pose of the user,

used for obtaining a plurality of groups of 2D images and 3D models with 2D-3D corresponding relation of the object,

Further, the YOLO6D improved network building module further includes a first subunit, where the first subunit is configured to add 1 × 1 convolution kernel between 3 × 3 convolution kernels in the original YOLO6D network, so as to double the number of channels after each maximum pooling operation.

The invention also proposes a computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method for improving the pose estimation of a network based on YOLO6D as claimed in any one of claims 1 to 5.

The beneficial effect of this disclosure does: the improved YOLO6D network provided by the disclosure has a simple structure, and when the position and pose identification is carried out on a target object by using the improved YOLO6D network, the position and pose estimation method based on the improved YOLO6D network provided by the disclosure is far higher than other methods in terms of running speed, and is particularly suitable for but not limited to occasions where the target object is single and the running speed is required.

Drawings

The foregoing and other features of the present disclosure will become more apparent from the detailed description of the embodiments shown in conjunction with the drawings in which like reference characters designate the same or similar elements throughout the several views, and it is apparent that the drawings in the following description are merely some examples of the present disclosure and that other drawings may be derived therefrom by those skilled in the art without the benefit of any inventive faculty, and in which:

FIG. 1 is a flow chart illustrating a pose estimation method based on YOLO6D improved network of the present disclosure;

FIG. 2 is a schematic diagram of a YOLOv2 network structure of the pose estimation method based on YOLO6D improved network of the present disclosure;

FIG. 3 is a diagram of an improved YOLO6D network structure of the pose estimation method based on the improved network of YOLO6D of the present disclosure;

FIG. 4 is a diagram showing the pose estimation effect of the pose estimation method based on the YOLO6D improved network in experimental demonstration;

fig. 5 is a schematic functional relationship diagram of a confidence value calculation formula of the pose estimation method based on the YOLO6D improved network according to the present disclosure.

Detailed Description

The conception, specific structure, and technical effects of the present disclosure will be described in detail below with reference to the accompanying drawings and embodiments, so that the purpose, scheme, and effects of the present disclosure can be fully understood. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The same reference numbers will be used throughout the drawings to refer to the same or like parts.

Referring to fig. 1 and 3, the present disclosure proposes a pose estimation method based on YOLO6D improved network, including the following,

YOLO6D improves the network set-up process,

the pose estimation process is carried out in the way that,

step 110, obtaining a plurality of sets of 2D images and 3D models of the target object with 2D-3D correspondence,

step 120, inputting the 2D image and the 3D model into a YOLO6D improved network, predicting 1 central point and 8 corner points of the projection of the bounding box of the 3D model on the 2D image,

step 130, estimating the pose of the target object according to 1 central point and 8 angular points of the projection of the bounding box of the 3D model on the 2D image by a PnP pose estimation algorithm,

and 140, outputting the pose estimation result meeting the evaluation index to obtain a final pose estimation result.

As a preferred embodiment of the present invention, the process of establishing the YOLO6D improved network in the method further includes,

adding 1 × 1 convolution kernels between the 3 × 3 convolution kernels in the original YOLO6D network to double the number of channels after each maximum pooling operation. The batch normalization operation is carried out on the input image data before each layer of the network, so that the accuracy (mAP) can be effectively improved, the convergence can be obviously improved, and overfitting can be prevented.

As a preferred embodiment of the present invention, the method further comprises performing a batch normalization operation on the input image data before improving each layer of the network by YOLO 6D.

As a preferred embodiment of the present invention, the method for predicting 1 central point and 8 corner points of the projection of the bounding box of the 3D model on the 2D image specifically includes the following steps,

g_x＝f(x)+c_x，g_y＝f(y)+c_y，

and the network is only required to be called once when the 6D object posture is estimated, so that the rapid operation of the network is ensured. Each mesh may predict the pose of objects within the mesh and remove prediction units with low confidence by adjusting the confidence threshold. For larger sized targets and projected objects located at the intersection of two grids, it is likely that multiple grids will all predict higher confidence. To obtain a more robust pose estimate, we find the cells of the 3 x 3 neighborhood with the highest confidence scores, combine the predictions of the corners of these neighboring cells by computing weighted averages of the individual detections, taking the confidence scores of the relevant cells as weights.

The network gives the angle of the 2D projection of the object centroid to its 3D bounding box and the object identification. We estimate the 6D pose from the correspondence between the 2D points and the 3D points using the PnP pose estimation method. The PnP algorithm uses 9 points corresponding to the rotation matrix and the translation matrix of the known object in the camera coordinates, including 8 corner points and a center point. But instead of predicting the coordinate values directly, the offset from the grid is predicted first. Here, the prediction of the center point and the corner points are different, because which grid the center point falls on is responsible for predicting the object, so the offset of the center point must fall within the grid, and therefore the output of the grid is compressed to 0-1 by the sigmoid activation function, but for the other 8 corner points, it is possible to fall outside the grid, so the coordinate offset of the 8 corner points can be expressed as,

g_x＝f(x)+c_x，g_y＝f(y)+c_y，

the approximate location of the target is found by a minimization formula, then refined to the vertex location,

wherein

In order to be a loss of coordinates,

in order to be a loss of confidence,

is a classification loss. The coordinate loss and confidence loss are expressed by mean square error function, and the classification loss is expressed by cross entropy function. In order to improve the stability of the model, the weight of the non-target-included object is set to 0.1, the weight of the target-included object is set to 5, and the weights of the classification loss function and the coordinate loss function are both set to 1.

In 2D image detection, there is a bounding box that can enclose an object in 2D, and our goal is to detect the 2D bounding box on the image and classify the bounding box. Similarly, in 3D object detection, there is a 3D bounding box in space that can enclose the object, and our goal is to detect and classify the 3D bounding box. The 3D bounding box may represent the pose of an object, which contains information about the position xyz of the object in 3D space, and the rotation angles of the object around the x, y and z axes. The 6 pieces of information are also called 6 degrees of freedom of the object, and as long as the 6 degrees of freedom of any object in the space are known, the unique object posture can be determined. Before predicting the 6D pose, first 1 center point and 8 corner points of the 3D bounding box projected on the 2D image are predicted, and we define these 9 control points as the center point and the bounding box corner points of the 3D object model. The 6D pose is calculated by the PnP algorithm through the 9 points. Therefore, the problem of predicting the 6D posture of the object can be converted into the problem of predicting 9 coordinate points.

The model takes an RGB image as input, the input image size is 416 × 416, it is down-sampled by 32 times with the full convolution structure shown in fig. 2, the output feature size is 13 × 13, the image is divided into a 2D regular grid containing S × S grids, each grid position in the output 3D tensor will be associated with a multidimensional vector, which includes the predicted positions of the 9 control points on the 2D image, the class probability of the object and the overall body confidence value.

The trained network can predict not only the accurate 2D position, but also a high confidence value of the existing region of the object and a low confidence value of the non-existing region. When detecting 2D objects, the score of the intersection (IoU) associated with the predicted anchor box and the true 2D rectangle in the image is typically used as its confidence value. While our object is 3D, to compute the equivalent IoU scores of two arbitrary cuboids, we need to compute the 3D region corresponding to their intersection. This calculation is complex and slows down the training speed. Therefore, we have taken a different approach. We model the predicted confidence values using the confidence function shown in fig. 3. The confidence function c (x) returns a confidence value for the predicted point, denoted by x, based on the distance of the predicted 2D point from the actual target 2D point. Referring to fig. 5, where the confidence values of the predicted 9 control points are calculated as follows,

wherein D_T(x) Defined as the 2D Euclidean distance in image space, is a sharp exponential function with a cutoff value, α is the sharpness of the exponential function, D_thIs a set threshold. In practice, we apply a confidence function to all control points, calculate the mean and assign it as confidence.

As a preferred embodiment of the present invention, the evaluation index of the YOLO6D improved network is used for error evaluation by the following formula, the 2D reprojection error represents the average distance between the 2D projection of the object 3D mesh vertex and the object true pose, and the pose estimation is considered to be accurate when the error is less than 5 pixels, and the related formula is as follows

e_REP＝||p_i-CHμ||₂，

the 5cm5 deg. criterion means that the estimate is correct if the translation error and the rotation error are below 5cm and 5 deg. respectively, the correlation formula is as follows,

e_TE＝||t-t′||₂，e_kE＝arccos[(Tr(RR′^-1)-1)/2]，

In particular, the method is subjected to experimental result analysis,

the method comprises the steps of firstly making an experimental data set, wherein the traditional data set in the LineMod format is very troublesome to make, firstly making a three-dimensional model, and then obtaining three-dimensional information of a target, and for an irregular target object, the three-dimensional model is very complicated to make and has the problem of low precision. The three-dimensional information data set is generated by using a two-dimensional code information identification method, and the three-dimensional coordinate information of the target object is obtained by combining the size of the minimum circumscribed rectangle of the target object, so that the manufacturing process of a three-dimensional model is avoided.

The data set acquisition platform comprises a Kinect2.0 camera, a rotary disc, a code disc, a target object, a tripod and the like. The data acquisition process is shown as follows, firstly, the camera is calibrated by using the printed checkerboard to obtain the internal and external parameters of the camera. And then, a code disc comprising at least one two-dimensional code is generated and printed by utilizing an Aruco library in Opencv, and the target object is placed at the middle position of the plane of the code disc. Starting the turntable, setting the rotation speed to be one circle in about 60 seconds, carrying out real-time video acquisition on the object on the code disc by using the camera in the rotating process of the turntable, and adjusting the acquisition angle of the camera in the process so as to obtain the data of each angle of the object.

At least one two-dimensional code on the code disc is not shielded in the acquisition process, if a plurality of two-dimensional codes are not shielded, one two-dimensional code can be selected as a calculation basis, a world coordinate system is set by taking the central point of the two-dimensional code as an original point, and a rotation matrix and a translation matrix of the world coordinate system relative to a camera coordinate system are calculated. And measuring the distance between the target object and the selected two-dimensional code and the external dimension of the target object, and calculating the world coordinates of 8 vertexes of the minimum circumscribed rectangle of the target object. The world coordinates of the 8 vertices and the internal and external parameters of the camera are combined, and the pixel coordinates of the 8 vertices are calculated through the project function in Opencv. And connecting 8 vertexes into 6 planes, setting the gray value of the pixel of the out-of-plane area to be 0, and setting the pixel value of the in-plane area to be 255, thereby obtaining the corresponding mask file. And in addition, the gray value of the out-of-plane pixels is set to be 0, the gray value of the in-plane pixels is unchanged, so that a mask-show file is obtained, whether the pixel coordinates of 8 vertexes are accurate or not is judged by checking the mask-show file, and the picture with overlarge error can be deleted manually. And after bad data are eliminated, generating a LineMod format data set by the mask, the target object and the pixel coordinates.

This document takes approximately 1000 pictures of the target object at various angles, 70% of which are used as the training set and 30% of which are used as the test set.

In order to improve the robustness of the network model and avoid overfitting, data enhancement is carried out on the image during experiments. After each reading of a picture from the training set, the image is randomly flipped, rotated, dithered, or the saturation and brightness of the image are exponentially changed by a factor of 1.5, or the image is randomly scaled and translated to 20% of the image size. Meanwhile, the confidence degree alpha is set to be 2, the distance threshold value is set to be 30pix, the learning rate is set to be 0.001, and the learning rate is changed to be one tenth of the original rate every 100 cycles.

Referring to fig. 4, our target objects contain different colors and shapes, and these targets are tested from different angles of top view, head-up view, side view, etc. The first, second and third columns of fig. 4 respectively show the results of posture estimation of the same target object from different angles under different lighting conditions, and the dimensions of the vitals milk box are: the length is 0.063m, the width is 0.041m, and the height is 0.105 m; the size of the coke can is: radius 0.033m, height 0.115 m; the size of the Wangzai milk box is as follows: the length is 0.048m, the width is 0.032m, and the height is 0.087 m. The fourth column of objects is cylinders of brass and red copper, respectively, with a radius of 0.01m and a height of 0.034m, the volume of the vitals milk box being more than 25 times greater. Therefore, the result of posture estimation on the target objects with the same material and different colors is shown, and the effect of the improved algorithm on the estimation of the pose of the small-size target is also shown. The fifth column shows the attitude estimation results of the plastic building blocks and the copper columns, and displays the coordinate attitude on the copper columns. The result shows that the minimum circumscribed rectangle of the target object can completely surround the target object, the final posture estimation result is very close to the real posture, and especially when the data set contains data with different brightness and fuzziness caused by illumination, the posture of the target can still be well calculated by the algorithm.

Contrasts with several algorithms which are currently used more widely. In this data set, Rcu and Ycu represent red and yellow copper columns, respectively, with a radius of 0.01m and a height of 0.034m, and are small, with a vitamin milk box more than 25 times its volume. From the results in tables 1-3, it can be seen that the improved network has relatively large size for vitamin milk and cola cans based on the improved posture estimation algorithm compared with the current mainstream algorithm, and the recognition accuracy of the target with relatively rich texture is as good as that of the original network, and is higher than the BB8 accuracy for the 2D reprojection error and the 5cm5 ° measurement result. The accuracy rate of estimation of the attitude of small targets such as Rcu and Ycu is slightly lower than that of the original network, but the method is far higher than other algorithms in the aspect of operation speed, the operation speed is nearly 12 times of that of BB8 algorithm, 17 times of that of Brachmann algorithm, 35FPS can be achieved, and the method is suitable for real-time processing.

Because the improved network has a simple structure, the estimation accuracy of the attitude of the small targets such as Rcu and Ycu is slightly lower than that of the original network. And because the improved network has a simple structure, the method is far higher than other algorithms in the aspect of operation speed, and the method is suitable for occasions with single target objects and requirements on the operation speed.

The following is a comparison graph of evaluation indexes, where Ours represents the relevant data index for the method,

TABLE 1 reprojection error accuracy (%)

TABLE 25 cm5 ° accuracy (%)

TABLE 3 processing speed comparison results (FPS)

YOLO6D improves the network setup module,

a pose estimation module for estimating the pose of the user,

As a preferred embodiment of the present invention, the YOLO6D improved network building module further includes a first subunit, where the first subunit is configured to add 1 × 1 convolution kernel between 3 × 3 convolution kernels in the original YOLO6D network, so as to double the number of channels after each maximum pooling operation.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing module, or each of the modules may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer-readable storage medium and used to implement the steps of the above embodiments of the method when executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier signal, telecommunications signal, and software distribution medium, etc. It should be noted that the computer readable medium may contain other components which are subject to appropriate increase or decrease according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to legislation and patent practice.

While the present invention has been described in considerable detail and with particular reference to a few illustrative embodiments thereof, it is not intended to be limited to any such details or embodiments or any particular embodiments, but it is to be construed as effectively covering the intended scope of the invention by providing a broad, potential interpretation of the claims in view of the prior art with reference to the appended claims. Furthermore, the foregoing describes the invention in terms of embodiments foreseen by the inventor for which an enabling description was available, notwithstanding that insubstantial modifications of the invention, not presently foreseen, may nonetheless represent equivalents thereto.

The above description is only a preferred embodiment of the present invention, and the present invention is not limited to the above embodiment, and all the embodiments should fall within the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. The pose estimation method based on the YOLO6D improved network is characterized by comprising the following steps,

YOLO6D improves the network set-up process,

the pose estimation process is carried out in the way that,

2. The YOLO6D improved network-based pose estimation method according to claim 1, wherein the YOLO6D improved network establishment process in the method further comprises,

3. A method for estimating pose of an improved network based on YOLO6D as claimed in claim 2, wherein the method further includes batch normalization of the input image data before each layer of the YOLO6D improved network.

4. The YOLO 6D-based pose estimation method for improving network according to claim 1, wherein the method of predicting 1 center point and 8 corner points of the projection of the bounding box of the 3D model on the 2D image comprises the following steps,

g_x＝f(x)+c_x，g_y＝f(y)+c_y，

5. The position and orientation estimation method based on the YOLO6D improved network according to claim 4, wherein the evaluation index of the YOLO6D improved network is subjected to error evaluation through the following formula,

e_REP＝||p_i-CHμ||₂，

e_TE＝||t-t′||₂，e_kE＝arccos[(Tr(RR′^-1)-1)/2]，

6. A pose estimation system based on a YOLO6D improved network is characterized by comprising,

YOLO6D improves the network setup module,

a pose estimation module for estimating the pose of the user,

7. The YOLO 6D-based improved network pose estimation system according to claim 6, wherein the YOLO6D improved network establishment module further comprises a first subunit for adding 1 × 1 convolution kernel between 3 × 3 convolution kernels in the original YOLO6D network, doubling the number of channels after each maximum pooling operation.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the steps of the method for improving pose estimation of a network based on YOLO6D according to any one of claims 1 to 5.