CN116228850A

CN116228850A - Object posture estimation method, device, electronic equipment and readable storage medium

Info

Publication number: CN116228850A
Application number: CN202111460674.6A
Authority: CN
Inventors: 何宝; 李炜明; 金知姸; 张现盛; 洪性勋; 王强
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2023-06-06
Also published as: KR20230083212A

Abstract

The embodiment of the application provides an object posture estimation method, an object posture estimation device, electronic equipment and a readable storage medium, wherein the object posture estimation method comprises the following steps: the method comprises the steps of determining key point information in an image to be processed, determining corrected key point information based on a key point feature map corresponding to the key point information, estimating the object posture in the image to be processed based on the corrected key point information, ensuring the object posture estimation precision, avoiding time consumption caused by image rendering cost and effectively improving the processing speed of AR application. Meanwhile, the above object pose estimation method performed by the electronic device may be performed using an artificial intelligence model.

Description

Object posture estimation method, device, electronic equipment and readable storage medium

Technical Field

The present application relates to the field of computer vision, and in particular, to an object pose estimation method, an object pose estimation device, an electronic device, and a readable storage medium.

Background

Computer vision-based augmented reality (Augmented Reality, AR) technology provides a real information experience for a user by adding virtual content in a real scene in front of the user.

In three-dimensional space, the augmented reality system needs to have high-precision real-time processing and understanding on the three-dimensional state of surrounding objects so as to complete the high-quality virtual-real fusion effect presented to the user. Accurate and fast object pose estimation is therefore very important for AR interactions.

The existing attitude estimation method is too time-consuming to meet the use requirement of an AR scene with high speed requirement.

Disclosure of Invention

An object of an embodiment of the present application is to solve the problem of how to increase the processing speed of object pose estimation.

According to an aspect of an embodiment of the present application, there is provided an object pose estimation method, including:

determining key point information in an image to be processed;

determining corrected key point information based on the key point feature map corresponding to the key point information;

and estimating the object posture in the image to be processed based on the corrected key point information.

According to an aspect of the embodiments of the present application, there is also provided an object pose estimation method, including:

converting the input image into an image to be processed of a predetermined image style;

determining key point information in an image to be processed;

based on the keypoint information, an object pose in the input image is estimated.

According to another aspect of the embodiments of the present application, there is provided an object pose estimation apparatus, including:

the determining module is used for determining key point information in the image to be processed;

the correction module is used for determining corrected key point information based on the key point feature map corresponding to the key point information;

and the estimation module is used for estimating the object posture in the image to be processed based on the corrected key point information.

According to another aspect of the embodiments of the present application, there is also provided an object pose estimation apparatus, including:

the style conversion module is used for converting the input image into an image to be processed with a preset image style;

the key point determining module is used for determining key point information in the image to be processed;

and the object posture estimation module is used for estimating the object posture in the input image based on the key point information.

According to still another aspect of the embodiments of the present application, there is provided an electronic device including: a memory, a processor and a computer program stored on the memory, the processor executing the computer program to implement the steps of the object pose estimation method provided by the embodiments of the present application.

According to still another aspect of the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the object pose estimation method provided by the embodiments of the present application.

According to yet another aspect of the embodiments of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the steps of the object pose estimation method provided by the embodiments of the present application.

According to the object posture estimation method, the device, the electronic equipment and the readable storage medium, through determining the key point information in the image to be processed and determining the corrected key point information based on the key point feature map corresponding to the key point information, the object posture estimation precision is ensured, time consumption caused by image rendering cost is avoided, and the processing speed of AR application can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that are required to be used in the description of the embodiments of the present application will be briefly described below.

Fig. 1 is a schematic flow chart of an object posture estimation method according to an embodiment of the present application;

fig. 2 is a schematic diagram of an image style conversion method according to an embodiment of the present application;

fig. 3 is a schematic diagram of an object pose estimation refinement method according to an embodiment of the present application;

FIG. 4a is a schematic diagram of a refinement network according to an embodiment of the present application;

FIG. 4b is a schematic diagram illustrating visualization of corrected keypoint information according to an embodiment of the present application;

FIG. 5a is a schematic diagram of an iterative refinement method for pose estimation according to an embodiment of the present application;

FIG. 5b is a second schematic diagram of the visualization of corrected keypoint information according to the embodiment of the present application;

FIG. 6a is a schematic diagram of a complete object pose estimation according to an embodiment of the present application;

FIG. 6b is a schematic representation of a visualization of an object pose in an estimated processed image provided by an embodiment of the present application;

fig. 7 is a flowchart of another object posture estimation method according to an embodiment of the present application

Fig. 8 is a schematic structural diagram of an object posture estimation device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of another object posture estimation device according to an embodiment of the present application;

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the drawings in the present application. It should be understood that the embodiments described below with reference to the drawings are exemplary descriptions for explaining the technical solutions of the embodiments of the present application, and the technical solutions of the embodiments of the present application are not limited.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and "comprising," when used in this application, specify the presence of stated features, information, data, steps, operations, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof, etc. that may be implemented as desired in the art. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates that at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Object pose estimation methods generally consist of 2 main phases: the first phase may give the initial pose estimate in various ways. The second stage performs refinement (refinishing), such as using a render-compare method to refine the pose of the previous stage. However, the refinement process of this method, because of the image rendering involved, consumes a lot of computing resources and computing time, causes a long delay in the AR application scene, and cannot bring about a good user experience.

Aiming at the technical problems or the places needing improvement, the embodiment of the application provides an object posture estimation method, an object posture estimation device, electronic equipment and a readable storage medium, which are used for realizing accurate and rapid augmented reality interaction.

The technical solutions of the embodiments of the present application and technical effects produced by the technical solutions of the present application are described below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, or combined with each other, and the description will not be repeated for the same terms, similar features, similar implementation steps, and the like in different embodiments.

An embodiment of the present application provides an object posture estimation method, as shown in fig. 1, including:

step S101: determining key point information in an image to be processed;

for the embodiment of the application, the image to be processed may be an input image to be subjected to object posture estimation processing, which is acquired by a pointer on a real scene (for example, captured by a sensor), or may be an image to be subjected to object posture estimation processing after performing specific processing on the input image.

The number of the objects in the image to be processed may be one or more, and embodiments of the present application are not limited herein.

The key points referred to in the embodiments of the present application may be understood as points having semantic features (i.e., having representative characteristics) for the object, and the key points in each image to be processed may also be one or more, for example, when the object is a table, the key points may be, but are not limited to, a table corner, a table edge, a pillar, a table surface texture, or a pattern, and the like.

Optionally, the key point information in the image to be processed may be extracted in real time, or the extracted key point information in the image to be processed may be obtained, or the key point information in any processing process may be determined. The embodiment of the application is not limited to the determination method of the key point information in the image to be processed. The embodiment of the present application is not limited to the extraction method of the key point information in the image to be processed, and may be, for example, extraction by using any one or more kinds of neural networks.

Step S102: determining corrected key point information based on the key point feature map corresponding to the key point information;

the feature map of the key point corresponding to the key point information may be a feature map cut out from the feature map corresponding to the image to be processed with each key point as a center. Alternatively, the feature map corresponding to the image to be processed may be a RoI (Region of Interest ) feature map, i.e., a depth feature map extracted in the region of interest of each object, but is not limited thereto.

In this embodiment of the present application, this step may also be understood as performing a refinement operation (refinement) on the keypoint feature map corresponding to the keypoint information, or may also be understood as a refinement operation, an optimization operation, or the like. The embodiment of the application avoids excessive dependence on the accuracy of the key points, and can enable the accuracy of the object attitude estimation to reach the optimal performance compared with the case that the object attitude is estimated by only using the original key points without executing the refining process.

Step S103: and estimating the object posture in the image to be processed based on the corrected key point information.

Wherein the object pose is used to describe the relative position of the object between two coordinate systems. In the embodiment of the present application, the object pose may be a 6DoF (Degree of Freedom ) pose of the object, but is not limited thereto. In particular, a 6DoF pose refers to characterizing the relative position of an object in space in terms of 3 types of translational degrees of freedom and 3 types of rotational degrees of freedom.

In the embodiment of the present application, the above-mentioned keypoints may refer to 2D keypoints projected onto a 2D (Two-Dimensional) image from 3D (Three Dimensional, three-Dimensional) keypoints of a corresponding object, that is, the embodiment of the present application estimates a 3D pose of the object based on the corrected 2D keypoints.

In the embodiment of the application, based on the corrected key point information, the object posture in the image to be processed is estimated, so that the problems of data noise and/or data missing can be processed.

The embodiment of the application can be applied to terminal equipment such as fixed terminals and/or mobile terminals, for example: a cell phone, tablet, notebook, wearable device, gaming machine, desktop, all-in-one, vehicle-mounted terminal, etc. The wearable device may also be referred to as an intelligent wearable device, and is a portable device that may be worn directly on a body or integrated into clothing or accessories, such as glasses, helmets, headbands, watches, wrist bands, smart clothing, schoolbags, crutches, accessories, and the like. In practical application, the wearable device may include an application function that can be completely implemented, may not depend on other devices such as a smart phone to use independently, and may also include one or some application functions, and needs to be used in cooperation with other devices such as a smart phone, which is not particularly limited in the embodiments of the present application.

According to the object posture estimation method provided by the embodiment of the application, the object posture estimation refinement method based on the key points is adopted, so that the time consumption caused by the image rendering cost is avoided while the object posture estimation precision is ensured, and the processing speed of AR application can be effectively improved.

In the embodiment of the application, considering that in the existing supervised learning, the acquisition of a large amount of training data for real gesture annotation consumes serious labor cost and time cost, and is easy to make mistakes, the best performance of the supervised training is limited. In one possible implementation, for known 3D models, training data with real pose annotations may be synthesized in bulk for model training using a computer graphics processor. However, training data with real pose annotations for training a model may have a particular style, such as a synthetic image style, and processing with the acquired input image (as opposed to the synthetic image style) may create inter-data information domain gaps that affect model performance. In this regard, the embodiment of the present application provides a possible implementation manner, and before step S101, the method may further include the steps of:

acquiring an input image;

And converting the input image into an image with a preset image style to obtain an image to be processed.

That is, the image to be processed may be an image to be subjected to object pose estimation processing after performing specific processing (i.e., image style conversion processing) on the input image.

The image style refers to a visual sense represented by image attributes such as illumination, texture, color temperature, tone, contrast, saturation, brightness and the like of an image. For example, the canvas style, sketch style, cartoon style, etc. may all be considered different styles. For another example, the input image (a photograph of a real scene) acquired by the acquisition device may also be considered to be of a different style than the composite image (a virtual image composited by the processor), typically the input image will have finer textures, darker lighting, etc. than the composite image. It will be appreciated that different image styles may cause an image to exhibit different effects, but without changing the content of the image.

In the embodiment of the present application, the predetermined image style may specifically be an image style of training data with real pose annotations for training a model, such as a synthetic image style, but is not limited thereto.

In this embodiment of the present application, as shown in fig. 2, the process may specifically include the following steps:

Step A: extracting image content characteristics of an input image through a content backbone network;

in the embodiment of the application, the input image is used as a content image for extracting content features.

In the embodiment of the application, the image content features may be low-resolution image content features.

The low resolution image refers to an image with a resolution lower than the full resolution of the image, and the embodiment of the present application does not specifically limit the degree of low resolution, and those skilled in the art may set the degree of low resolution according to the actual situation.

In one possible implementation, the content backbone network may extract low resolution image content features directly on the incoming real image.

In another possible embodiment (for example, in fig. 2), the input image may be downsampled to obtain a low resolution input image, and then input into the content backbone network to extract the content features of the low resolution image.

It will be appreciated that the above-described different embodiments may be based on different training modes, one of which is shown in fig. 2, and other training modes may be appropriately changed based on this example, which will not be described herein.

And (B) step (B): acquiring preset image style characteristics;

In the embodiment of the present application, the image style feature may be a low resolution image style feature.

In practical applications, a person skilled in the art may preset appropriate low-resolution image style characteristics according to practical situations, and the types and contents of the low-resolution image style characteristics are not specifically limited in this embodiment of the present application. The preset low-resolution image style characteristics can be directly obtained and used in an online inference stage.

In order to improve efficiency and reliability, the embodiment of the present application provides a feasible implementation manner, where the low-resolution image style feature may directly use data of a training stage, for example, a preset low-resolution image style feature is an average value of low-resolution image style features corresponding to each training sample obtained in the training stage.

For ease of understanding, the training process shown in FIG. 2 will be described first:

the style conversion network model of the training phase includes at least three parts: content feature and style feature extraction, bilateral prediction, and rendering.

Further, the three parts can be specifically: low resolution content feature and low resolution style feature extraction, low resolution bilateral prediction, and full resolution rendering.

In the model training phase, the input training samples are two images (downsampled low-resolution images can also be directly adopted), one is the input image sample, the other is the predetermined style image sample (hereinafter, a synthetic image sample is taken as an example), and the content characteristics of the (low-resolution) input image sample and the style characteristics of the (low-resolution) synthetic image sample are respectively extracted through two backbone networks. The (low resolution) composite image sample may be stored for use in the reasoning stage, i.e. the pre-set low resolution image style features obtained in the reasoning stage may be used as an average of the low resolution image style features corresponding to each of the pre-determined style image samples obtained in the training stage.

Further, in the model training stage, the content features of the (low resolution) input image samples and the style features of the (low resolution) composite image samples are input into a (low resolution) bilateral network, which fuses the content features of the (low resolution) input image samples and the style features of the (low resolution) composite image samples.

Specifically, the content specifically executed by the (low resolution) bilateral network includes: the (low-resolution) content features and style features are used as input prediction coefficients, the joint distribution of low-level features is learned, a bilateral grid is predicted, and how the bilateral grid is coded and converted is learned.

Further, in the model training stage, the feature fusion result output by the (low resolution) bilateral network is input to a (full resolution) renderer, the (full resolution) rendering operates at full resolution, the feature fusion result output by the (low resolution) bilateral network and the input image sample (full resolution image content) are rendered, and the (full resolution) synthesized image is output. Specifically, sampling work is performed for each pixel from the bilateral mesh by multiplication using a predetermined matrix (for example, 3×4 matrix). Based on the output synthesized image and the corresponding training sample, determining whether the content and the style respectively meet the training ending conditions, if so, obtaining a trained style conversion network model, and if not, adjusting model parameters to continuously train the style conversion network model.

In the embodiment of the application, in the model training stage, the adopted loss function may include, but is not limited to, a KL (Kullback-Leibler) loss function, a bilateral space Laplacian conventional ware loss function, and the like.

The content backbone network (which may also be referred to as a content feature extraction backbone network) included in the trained style conversion network model may be used in step a, the (low resolution) bilateral network and the (full resolution) renderer may be used in step C and step D, respectively, described below.

Step C: fusing the image content characteristics and the image style characteristics through a bilateral network to obtain fused characteristics;

for example, two low-resolution features are input into a trained low-resolution bilateral network, and low-resolution fusion features obtained by fusion of low-resolution image content features and low-resolution image style features are predicted.

Step D: and rendering the input image based on the fusion characteristics by a renderer to obtain an image to be processed.

The input image comprises input image content, the input image and the fusion characteristic are input into a trained renderer for rendering, and then a generated image (composite image) which is the same as the input image content and has a predetermined image style can be obtained and used as an image to be processed for subsequent refinement operation and object posture estimation.

Alternatively, this step uses a full resolution image of the input image, including the full resolution content of the input image, inputs the full resolution image of the input image and the low resolution fusion feature into a trained full resolution renderer for rendering, and then obtains a generated image (composite image) of the same style as the input content of the input image as a predetermined image style, and performs subsequent refinement operation and object pose estimation as an image to be processed.

In general, the style conversion process flow shown in fig. 2 mainly includes:

1. training phase:

(1) The method comprises the steps of inputting downsampled low-resolution input image samples (corresponding to low-resolution image content in a training stage in fig. 2) into an image content extraction backbone network, inputting downsampled low-resolution synthesized image samples (corresponding to low-resolution image style in the training stage in fig. 2) into an image style extraction backbone network, and respectively outputting low-resolution content characteristics and low-resolution style characteristics by the two backbone networks;

(2) Inputting the low-resolution content features and the low-resolution style features into a low-resolution bilateral network, fusing the two low-resolution features by the low-resolution bilateral network, and outputting a low-resolution feature fusion result;

(3) The low resolution feature fusion result is input to a full resolution (image) renderer, which outputs a composite image, while a full resolution image (corresponding to the full resolution image content of the training phase in fig. 2) of the input image sample is input.

(4) Based on the output synthesized image and the corresponding training sample, determining whether the content and the style respectively meet the training ending conditions, if so, obtaining a trained style conversion network model, and if not, based on a preset loss function, adjusting model parameters to continuously train the style conversion network model. The loss function below the training process in fig. 2 may be, but is not limited to, a bilateral spatial Laplacian conventional loss function.

2. Reasoning:

(1) Inputting downsampled low-resolution input images (corresponding to the low-resolution image content of the reasoning stage in fig. 2) into an image content extraction backbone network, wherein the image content extraction backbone network outputs low-resolution image content characteristics respectively;

(2) The low-resolution image content features and the preset low-resolution image style features are input into a low-resolution bilateral network, the low-resolution bilateral network fuses the two low-resolution features and outputs low-resolution fusion features;

(3) The low resolution fusion feature and the full resolution image of the input image (full resolution image content corresponding to the inference stage in fig. 2) are input to a full resolution renderer, which outputs a composite image (image to be processed).

In the embodiment of the application, the input image is converted into the preset image style (such as the synthetic image style), so that inter-domain gaps between training data (synthetic image style) with the object gesture annotation and the input image are reduced, the model performance is improved, and the robustness of three-dimensional object gesture estimation is improved.

For the embodiment of the application, the style conversion process is lightweight, and has little influence on the running speed. In addition, only one input image is used when style conversion processing is carried out, so that the running time is further saved.

In this embodiment, for step S101, the key point information in the image to be processed may specifically be coordinate information of a visible key point. Wherein, the visible key points refer to effective key points which are not self-shielded or shielded by other objects.

In one possible implementation, step S101 may specifically include: determining key point coordinate information and key point visible information in an image to be processed through a detector network;

the key point information is determined based on the key point coordinate information and the key point visible information.

I.e. the determined key point information is the visible key point coordinate information.

In this embodiment, the image to be processed is input to a detector network, and as shown in fig. 3, the detector network outputs the key point visible information (Vk) and the key point coordinate information (Kj (x, y)) of the object in the image to be processed. In practice, the detector network may also output the object class (Ci). Further, an aligned RoI profile of the image to be processed can also be obtained by intermediate processing of the detector network.

Wherein, the aligned RoI profile refers to: since the network design is multi-scale feature fusion operation, the same size output feature map obtained by the RoI alignment is required for input regions with different feature sizes.

Alternatively, the key point (coordinate information) regression loss function may employ a swing loss function, the key point visible information classification loss function may employ an L1 loss function, the object class individual loss function may employ a cross entropy loss function, or the like, but is not limited thereto.

Alternatively, the keypoint visible information of the object in the image to be processed may be output by a visible keypoint proposal network in the detector network. Specifically, the visual keypoint proposal network may predict the visual keypoint proposal first, and then determine the final keypoint visual information in the predicted visual keypoint proposal.

In the embodiment of the present application, when the prediction of the visible key point proposal is performed through the visible key point proposal network, the prediction results of the feature maps of the plurality of adjacent area units may be fused. For example, when predicting whether a corner is a visible key point, feature maps of a plurality of adjacent area units of the feature map where the corner is located can be fused to obtain a corresponding prediction proposal.

In the embodiment of the present application, when the prediction results of the feature maps of the plurality of neighboring area units are fused, a manner of weighting and summing the feature maps may be adopted.

Alternatively, the weighted sum weighting coefficients may be derived from a trained neural network model.

Alternatively, another way of obtaining the weighting coefficients may be used: the adjacent area units are constructed into a graph, nodes of the graph are all the area units, edges of the graph are similar relations of the adjacent area units, and weighting coefficients of all the area units are deduced through structures and connection relations of the graph.

In the embodiment of the application, other supervision, such as two-dimensional image segmentation, can be adopted by the visible key point proposal network during training, and multiple supervision can be combined during training.

It should be noted that the method can be applied to single-stage or multi-stage object-visible key-point proposal prediction.

After the visible information of the key points and the coordinate information of the key points of the object in the image to be processed are obtained, judging and screening can be carried out on whether the key points of the object are visible, useless key points which are shielded by the object or by other objects are deleted, and only visible effective key points are reserved.

Further, the feature map corresponding to the image to be processed may be cut based on the effective visible key points, so as to obtain a key point feature map corresponding to the key point information, which is used to execute step S102.

In this embodiment of the present application, the RoI feature map of the image to be processed may be obtained through the detector network, and before step S102, the method may further include: based on the keypoint information, a corresponding keypoint feature map is determined in the RoI feature map.

As an example, for each key point (x, y), a 16x16 feature map or the like centered on the key point (x, y) is cut out from the RoI feature map corresponding to the image to be processed. In practical application, a person skilled in the art may set the cutting mode according to practical situations, and the embodiment of the present application is not limited herein.

In the embodiment of the application, when the number of the objects in the image to be processed is at least two, the types of the objects in the image to be processed can be determined through the detector network; step S102 may be preceded by: based on the object category and the keypoint information, a corresponding keypoint feature map is determined.

That is, when a plurality of objects are included in the image to be processed, for each key point (x, y), a feature map centered on the key point (x, y) needs to be cut out from the RoI feature map corresponding to the image to be processed according to the object category corresponding to the key point (x, y). In practical application, a person skilled in the art may set the cutting mode according to practical situations, and the embodiment of the present application is not limited herein.

In this embodiment, a possible implementation manner is provided for step S102, which may specifically include the steps of:

step S1021: performing key point offset regression on the key point feature map corresponding to the key point information to obtain key point offset residual information;

Specifically, as further shown in fig. 3, the cut keypoint feature graphs may be respectively input into a refinement network, and the refinement network may respectively perform keypoint offset regression on each keypoint feature graph to obtain a keypoint offset residual (Δx, Δy).

Step S1022: and obtaining corrected key point information based on the key point offset residual information and the key point information.

Specifically, the corrected key point information can be obtained by adding the key point offset residual information (Δx, Δy) to the key point information (x, y)

For the embodiment of the application, in model training, the loss function adopted by the keypoint refinement task may be an L1 loss function, but is not limited thereto.

In the embodiment of the application, one possible implementation way is provided for refining the network. Specifically, the refinement network may include a sub-network with at least one resolution connected, and step S1021 may specifically include:

extracting semantic features of corresponding scales from the key point feature graphs corresponding to the key point information through at least one resolution sub-network respectively;

fusing semantic features of all scales to obtain fused semantic features;

and carrying out regression processing on the fusion semantic features through the full connection layer to obtain the key point offset residual information.

As shown in fig. 4a, a structure similar to Lite-HRNet (A Lightweight High-Resolution Network, lightweight high resolution network) -18 can be employed as the backbone network,

a high resolution representation can be maintained throughout the network. Starting from the high resolution sub-network, the sub-networks from high resolution to low resolution are gradually increased, a plurality of stages are formed one by one, and the sub-networks with multiple resolutions are connected in parallel. And extracting semantic features of corresponding scales from the key point feature graphs corresponding to the key point information through the sub-network of each resolution.

Further, multi-scale fusion is repeatedly performed such that each high-resolution to low-resolution representation receives information from other parallel representations, thereby obtaining a rich high-resolution representation. So that the predicted keypoint offset is also more accurate in space.

Further, regression processing is performed on the fusion semantic features through the full connection layer, so as to obtain predicted key point offset residual information (also called key point residual information). As described above, the corrected key point information can be obtained by adding the key point offset residual information (Deltax, deltay) to the key point information (x, y)

In general, the refinement network processing flow shown in fig. 4a mainly includes:

(1) Constructing sub-networks with various resolutions (3 are taken as examples in fig. 4 a) through a convolution unit, an up-sampling operation and a down-sampling operation, and respectively extracting semantic features with corresponding scales from the key point feature images through the sub-networks with various resolutions;

(2) Fusing semantic features of all scales to obtain fused semantic features;

(3) Regression processing is carried out on the fusion semantic features through the full connection layer, so that key point offset residual information is obtained;

(4) And adding the key point offset residual information to the corresponding key point information to obtain corrected (2D) key point information.

The obtained visualization of the corrected (2D) keypoint information is shown in fig. 4 b.

For the embodiment of the application, the refinement network shown in fig. 4a can adjust the proportion and the number of stages according to the speed requirement during reasoning, so that the refinement network has higher flexibility.

Further, in step S103, a PNP (Perspective-n-Point) algorithm is used to estimate the object pose.

Specifically, the key point information is 2D key point information, and step S103 may specifically include step S1031: based on the corrected 2D keypoint information and a preset 3D model set, estimating the object pose in the image to be processed through a PnP algorithm, specifically, estimating the 6DoF pose (which may also be referred to as a 6DoF pose) of the object in the image to be processed, as shown in fig. 3.

The PnP algorithm is an algorithm for solving 3D to 2D point-to-motion, and solves the pose of the camera according to the real coordinates and the image coordinates in the space. For the embodiment of the application, the PnP algorithm can estimate the pose of the camera when knowing a plurality of 3D keypoints and their corresponding projection positions (2D keypoints).

In this embodiment of the present application, the preset 3D model set includes 3D models of each object in the real scene corresponding to the image to be processed. In practical applications, each 3D model may be derived from a known CAD (Computer Aided Design computer aided design) model (for example, in fig. 3), for example, a 3D model of an object may be obtained by searching from a three-dimensional CAD model library of the object, or each 3D model may be obtained by scanning a 3D object, and the source of the preset 3D model set is not specifically limited in this embodiment of the present application.

Further, in combination with the description above, when the image to be processed is input to the detector network to determine 2D keypoint information in the image to be processed, the detector network may also output an object class (C _i ) Namely, step S101 may include: by means of the detector network, the object class in the image to be processed is determined. The determined object class may be used to determine 3D keypoint information.

Specifically, step S1031 may include:

determining 3D key point information according to the object category and the 3D model set;

based on the corrected 2D key point information and the 3D key point information, estimating the object posture in the image to be processed through a PnP algorithm.

According to the method and the device for estimating the three-dimensional object attitude in the image to be processed, the 3D model set is searched in combination with the object category, the corresponding 3D key point information can be determined more efficiently, and then the three-dimensional object attitude in the image to be processed is estimated through the PnP algorithm based on the corrected 2D key point information and the 3D key point information.

In general, the object pose estimation process flow shown in fig. 3 mainly includes:

(1) The image to be processed is input to a detector network, which outputs the keypoint visible information (whether visible) and the keypoint coordinate information of the object in the image to be processed. In practical applications, the detector network may also output the object class;

(2) Judging and screening whether the key points of the object are visible or not based on the visible information of the key points and the coordinate information of the key points of the object in the image to be processed, deleting useless key points which are shielded by the object or are shielded by other objects, and only keeping the visible effective key points;

(3) Cutting the feature map corresponding to the image to be processed based on the effective visible key points to obtain a key point feature map corresponding to the key point information;

(4) Inputting the cut key point feature images into a refinement network respectively, and performing key point offset regression on each key point feature image by the refinement network respectively to obtain corresponding key point offset residual information;

(5) Adding the key point offset residual information to the corresponding key point information to obtain corrected (2D) key point information;

(6) Based on the corrected 2D key point information and the corresponding 3D key point information (such as the 3D key point information in a CAD model), estimating the 6DoF gesture of the object in the image to be processed through a PnP algorithm.

The keypoint refinement process may include, but is not limited to, a process (4) and a process (5).

The embodiment of the application also provides a feasible implementation manner, an iterative key point alignment method is used for estimating the object gesture, and the key points and the 6DoF gesture of the object are updated alternately by performing steps S101 to S103 in an iterative manner, wherein an iterative key point alignment feedback loop is shown in fig. 5 a.

Specifically, the key point information is 2D key point information, and after the object pose in the image to be processed is estimated for the first time, the method specifically may further include the steps of:

the following operations are repeatedly performed until a stop condition is satisfied:

Determining 3D key point mapping information of an object in the image to be processed based on the estimated object posture in the image to be processed;

obtaining updated key point information based on the 3D key point mapping information;

and re-determining the corrected key point information based on the updated key point information, and estimating the object posture in the image to be processed based on the re-determined corrected key point information.

Specifically, determining 3D keypoint mapping information of an object in the image to be processed based on the estimated object pose in the image to be processed may specifically include: and determining 3D key point mapping information of the object in the image to be processed based on the estimated object posture in the image to be processed, a preset 3D model set and a camera internal matrix.

Similarly, the preset 3D model set includes 3D models of objects in the real scene corresponding to the image to be processed. In practical applications, each 3D model may be derived from a known CAD model (for example, fig. 5 a), for example, a 3D model of an object may be obtained by searching from a three-dimensional CAD model library of the object, or each 3D model may be obtained by scanning a 3D object, and in this embodiment of the present application, a preset 3D model set is not specifically limited.

In this embodiment of the present application, the 3D keypoint mapping information of the object in the image to be processed is determined based on the estimated object pose in the image to be processed, the preset 3D model set, and the camera internal matrix, and specifically may be the 3D keypoint mapping information of the object in the image to be processed is determined based on the estimated object pose in the image to be processed, the 3D keypoint information (corresponding to the object in the image to be processed) in the preset 3D model set, and the camera internal matrix.

Wherein, the intrinsic matrix of the camera is the attribute of the camera, and the intrinsic matrix of different cameras can be different. In the embodiment of the application, for the acquired input image, the intrinsic matrix of the camera acquiring the input image may be acquired correspondingly.

In this embodiment, taking the current iteration as the k+1th time to determine the updated key point information in the to-be-processed image as an example, after the object pose in the to-be-processed image is estimated by the k iteration, determining the 3D key point mapping information of the object in the to-be-processed image by combining the corresponding 3D key point information in the preset 3D model set and the camera internal matrix, and updating the key point information in the to-be-processed image used by the k iteration to obtain the key point information in the to-be-processed image used by the k+1th iteration, so as to refine the key point of the k+1th iteration and estimate the object pose.

By way of example, assuming that the current iteration is at the kth step, the estimated object pose is [ R ] ^(k) T ^(k) ]Where R is the rotation matrix and T is the translation vector. 3D keypoints in corresponding 3D model setsInformation is p= [ P1, P2, ], pn]If the intrinsic matrix of the camera is K, the following formula can be used to calculate the 3D keypoint mapping information:

p ^(k+1) ＝K[R ^(k) T ^(k) ]P

updating the 2D key points based on the calculated 3D key point mapping information to obtain key point information p of the (k+1) th iteration ^(k+1) 。

Further, the updated key point information p can be based ^(k+1) Inputting each keypoint-centric feature map to a refinement network to obtain corrected keypoints

Still further, based on the corrected keypoint information

Updating the object posture [ R ] in the k+1st iteration estimated image to be processed by adopting PNP algorithm ^(k+1) T ^(k+1) ]。

The process is repeated to carry out iterative updating on the key points and the 6DoF gestures of the object, so that the accuracy of estimating the key points and the 6DoF gestures of the object can be improved.

In the embodiment of the application, the iterative updating of the key point information is stopped when the stopping condition is met.

Wherein the stop condition includes at least one of:

the difference value before and after the key point correction is smaller than a threshold value;

the key point correction reaches a predetermined number of times.

The number of key point correction can be understood as the number of object gesture refinement iterations.

In the embodiment of the application, only the iteration can be set to stop when the difference value before and after the correction of the key point is smaller than the threshold value; or only setting the iteration to stop when the preset number of times (namely the iteration number) of the key point correction reaches the preset number of times; the two conditions can be set simultaneously, and iteration can be stopped after any one of the two conditions is satisfied in advance.

Specifically, the difference value before and after the keypoint correction being smaller than the threshold value includes at least one of:

the sum of the differences before and after the correction of the at least one key point is smaller than a threshold value;

the difference before and after correction for each of the at least one keypoint is less than a threshold.

As an example, the keypoint update difference sum D may be:

i.e. the corrected keypoints obtained in the k+1th iteration

Modified keypoint from the k-th iteration +.>

If the sum of the differences of (2) is less than the threshold, no further iterations of (k+2) will be performed;

and/or, when the modified keypoints obtained in each k+1th iteration

Modified keypoints +.>

When the difference of (c) is smaller than the threshold value, the k+2 iteration will not be performed.

And/or when the number of iterations reaches a predetermined number k+1, the k+2th iteration will not be performed.

For this example, the object pose [ R ] in the image to be processed estimated in the (k+1) th iteration ^(k+1) T ^(k+1) ]And can be used as the output of the network, namely the final object posture estimation result.

In general, the iterative refinement process flow shown in fig. 5a mainly includes:

(1) Cutting out a key point feature map taking each key point as a center based on the initialized 2D key point information, object types and RoI feature map;

(2) Inputting the cut key point feature images into a refinement network respectively, and performing key point offset regression on each key point feature image by the refinement network respectively to obtain corresponding key point offset residual information;

(3) Adding the key point offset residual information to the corresponding 2D key point information to obtain corrected 2D key point information;

(4) Estimating the 6DoF gesture of the object in the image to be processed through a PnP algorithm based on the corrected 2D key point information and the corresponding 3D key point information;

(5) Determining 3D key point mapping information of an object in the image to be processed based on the estimated object gesture in the image to be processed, the corresponding 3D key point information and the camera internal matrix;

(6) Updating the key point information in the step (1) based on the 3D key point mapping information to obtain updated key point information, and re-executing from the step (1) based on the updated key point information;

repeating the steps (1) - (6) until the difference value before and after the key point correction is smaller than the threshold value and/or the iteration reaches a preset number of times.

The obtained visualization of the corrected (2D) keypoint information is shown in fig. 5 b.

In the embodiment of the application, the iterative refinement method of the object 6dof pose estimation may use a CNN (Convolutional Neural Networks, convolutional neural network) network, but is not limited thereto. Specifically, 3D key points of the object are projected onto a 2D image, and feature extraction is performed through an encoder to obtain corresponding 2D key point information. The key point feature graphs corresponding to the 2D key point information of the objects are subjected to convolutional neural networks to obtain corrected 2D key point information, and the corrected 2D key point information is used for predicting the 3D gesture of the objects.

In the embodiment of the application, the object key points and the 6DoF gestures are subjected to iterative optimization by an iterative refinement method so as to improve object gesture estimation.

Based on the above embodiments, the embodiments of the present application provide a complete process of object pose estimation, as shown in fig. 6a, mainly including:

(1) Inputting the input image into a style conversion network, and outputting a to-be-processed image in the style of the full-resolution synthetic image by the style conversion network;

(2) The image to be processed is input to a detector network, which outputs the keypoint visible information (whether visible) and the keypoint coordinate information of the object in the image to be processed. In practical application, the detector network can also output an object class and an intermediate processing result RoI characteristic diagram;

(3) Judging and screening whether the key points of the object are visible or not based on the visible information of the key points and the coordinate information of the key points of the object in the image to be processed, and reserving the visible key point information;

(4) Cutting out a key point feature map taking each key point as a center based on the visible key point information and the RoI feature map, and if a plurality of objects are contained in the image to be processed, cutting out the key point feature map based on the object types;

(5) Inputting the cut key point feature images into a refinement network respectively, and performing key point offset regression on each key point feature image by the refinement network respectively to obtain corresponding key point offset residual information;

(6) Adding the key point offset residual information to the corresponding 2D key point information to obtain corrected 2D key point information;

(7) Estimating the 6DoF gesture of the object in the image to be processed through a PnP algorithm based on the corrected 2D key point information and the corresponding 3D key point information;

(8) Determining 3D key point mapping information of an object in the image to be processed based on the estimated object gesture in the image to be processed, the corresponding 3D key point information and the camera internal matrix;

(9) And (3) obtaining updated visible key point information based on the 3D key point mapping information, and re-executing from (4) based on the updated visible key point information.

Repeating the steps (4) - (9) until the difference value before and after the key point correction is smaller than the threshold value and/or the iteration reaches a preset number of times.

Wherein a visual map of the object pose in the estimated processed image is shown in fig. 6 b.

The object posture estimation method provided by the embodiment of the application adopts an end-to-end trainable key point regression and refinement method, and solves the problem of difficulty in training in the prior art.

The object posture estimation method provided by the embodiment of the application avoids time-consuming image rendering and comparison circulation in most of the existing posture refinement methods, and is very effective in speed and precision.

The 3D object 6DoF attitude estimation and optimization method based on the color image input can improve the efficiency and the robustness of a system in augmented reality application.

According to the style conversion deep learning network provided by the embodiment of the application, the problem of data domain interval is solved by converting the input image into the image in the style of the synthesized image for processing, and the robustness of 3D object posture estimation is improved.

The object visible 2D key point estimation and the object 6DoF attitude estimation method based on the key points can solve the problem of data noise and data missing.

According to the iterative optimization and 6DoF attitude estimation optimization method for the object key points, which are provided by the embodiment of the application, the result precision is improved by using geometric priori knowledge.

Based on the embodiments, the object posture estimation method provided by the embodiment of the application can realize accurate and rapid augmented reality interaction.

In practical applications, AR applications are authorized through 3D virtual-real alignment, for which it is crucial to identify the DoF pose of the object 6. By adopting the object posture estimation method provided by the embodiment of the application, the virtual content can be quickly and effectively aligned with the real object in real time by correctly estimating the 3D posture. In particular for moving objects in a real scene, real-time pose estimation may ensure that virtual content is updated in time when using a visual device in an AR system, such that there is no or less delay in order to better handle human interactions with objects, in particular interactions that occur when objects may move in a real scene.

In practice, training data with real pose annotations for training a model may have a particular style, such as a synthetic image style, and processing with an acquired real scene image input image (as opposed to a synthetic image style) as input may create inter-data information domain gaps that affect model performance.

Based on this, the embodiment of the application further provides an object posture estimation method, as shown in fig. 7, where the method includes:

step S201: converting the input image into an image to be processed of a predetermined image style;

step S202: determining key point information in an image to be processed;

step S203: based on the keypoint information, an object pose in the input image is estimated.

And (B) step (B): acquiring preset image style characteristics;

For ease of understanding, reference is made to the description of the training process shown in fig. 2 above, which is not repeated here.

In the embodiment of the application, the input image is converted into the predetermined image style (such as the synthetic image style), so that inter-domain gaps between training data (synthetic image style) with the object gesture annotation and the input image are reduced, the model performance is improved, and the robustness of three-dimensional object gesture estimation is improved.

For the embodiment of the present application, the other processing procedures and the resulting beneficial effects may be specifically referred to the foregoing descriptions, and will not be repeated here.

An embodiment of the present application provides an object posture estimation device, as shown in fig. 8, the object posture estimation device 80 may include: a determination module 801, a correction module 802, and an estimation module 803, wherein,

the determining module 801 is configured to determine key point information in an image to be processed;

the correction module 802 is configured to determine corrected key point information based on the key point feature map corresponding to the key point information;

the estimation module 803 estimates the object pose in the image to be processed based on the corrected keypoint information.

In an alternative embodiment, the determining module 801, before being used for determining keypoint information in the image to be processed, is further configured to:

acquiring an input image;

In an alternative embodiment, the determining module 801 is specifically configured to, when configured to convert an input image into an image of a predetermined image style, obtain an image to be processed:

extracting image content characteristics of an input image through a content backbone network;

acquiring preset image style characteristics;

fusing the image content characteristics and the image style characteristics through a bilateral network to obtain fused characteristics;

and rendering the input image based on the fusion characteristics by a renderer to obtain an image to be processed.

In an alternative embodiment, the image content features are low resolution image content features; the image style feature is a low resolution image style feature.

In an alternative embodiment, the modification module 802 is specifically configured to, when configured to determine the modified keypoint information based on the keypoint feature map corresponding to the keypoint information:

performing key point offset regression on the key point feature map corresponding to the key point information to obtain key point offset residual information;

and obtaining corrected key point information based on the key point offset residual information and the key point information.

In an alternative embodiment, the keypoint information is 2D keypoint information, and the determining module 801 is further configured to:

re-determining the corrected key point information based on the updated key point information, and estimating the object posture in the image to be processed based on the re-determined corrected key point information;

wherein the stop condition includes at least one of:

the key point correction reaches a predetermined number of times.

In an alternative embodiment, the determining module 801, when configured to determine 3D keypoint mapping information of an object in the image to be processed based on the estimated pose of the object in the image to be processed, is specifically configured to:

and determining 3D key point mapping information of the object in the image to be processed based on the estimated object posture in the image to be processed, a preset 3D model set and a camera internal matrix.

In an alternative embodiment, the difference before and after the keypoint correction being less than the threshold comprises at least one of:

In an alternative embodiment, the determining module 801, when used for determining keypoint information in an image to be processed, is specifically configured to:

determining key point coordinate information and key point visible information in an image to be processed through a detector network;

In an alternative embodiment, the determining module 801 is further configured to:

obtaining a RoI feature map of an interest area of an image to be processed through a detector network;

before the correction module 802 determines corrected keypoint information based on the keypoint feature map corresponding to the keypoint information, the determination module 801 is further configured to:

based on the keypoint information, a corresponding keypoint feature map is determined in the RoI feature map.

when at least two objects are in the image to be processed, determining the object types in the image to be processed through a detector network;

Based on the object category and the keypoint information, a corresponding keypoint feature map is determined.

In an alternative embodiment, the keypoint information is 2D keypoint information, and the estimating module 803 is specifically configured to, when configured to estimate the pose of the object in the image to be processed based on the corrected keypoint information:

based on the corrected 2D key point information and a preset 3D model set, estimating the object posture in the image to be processed through a perspective n-point PnP algorithm.

In an alternative embodiment, the determining module 801, when used for determining 2D keypoint information in an image to be processed, is specifically configured to:

determining, by a detector network, a class of objects in the image to be processed;

the estimating module 803 is configured to, when estimating, by a PnP algorithm, an object pose in an image to be processed based on the corrected 2D keypoint information and a preset 3D model set, specifically:

The apparatus of the embodiments of the present application may perform the method provided by the embodiments of the present application, and implementation principles thereof are similar, and actions performed by each module in the apparatus of each embodiment of the present application correspond to steps in the method of each embodiment of the present application, and detailed functional descriptions and beneficial effects generated by each module of the apparatus may be specifically referred to descriptions in the corresponding methods shown in the foregoing, which are not repeated herein.

The embodiment of the present application provides an object posture estimation device, as shown in fig. 9, the object posture estimation device 90 may include: a style conversion module 901, a keypoint determination module 902, and an object pose estimation module 903, wherein,

the style conversion module 901 is used for converting an input image into an image to be processed with a predetermined image style;

the key point determining module 902 is configured to determine key point information in an image to be processed;

the object pose estimation module 903 is configured to estimate an object pose in an input image based on keypoint information.

In an alternative embodiment, the style conversion module 901, when used for converting an input image into a to-be-processed image of a predetermined image style, is specifically configured to:

acquiring preset image style characteristics;

The apparatus provided in the embodiments of the present application may implement at least one module of the plurality of modules through an AI (Artificial Intelligence ) model. The functions associated with the AI may be performed by a non-volatile memory, a volatile memory, and a processor.

The processor may include one or more processors. In this case, the one or more processors may be general-purpose processors such as a Central Processing Unit (CPU), an Application Processor (AP), etc., or purely graphics processing units such as Graphics Processing Units (GPUs), visual Processing Units (VPUs), and/or AI-specific processors such as Neural Processing Units (NPUs).

The one or more processors control the processing of the input data according to predefined operating rules or Artificial Intelligence (AI) models stored in the non-volatile memory and the volatile memory. Predefined operational rules or artificial intelligence models are provided through training or learning.

Here, providing by learning refers to deriving a predefined operation rule or an AI model having a desired characteristic by applying a learning algorithm to a plurality of learning data. The learning may be performed in the apparatus itself in which the AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

The AI model may include a plurality of neural network layers. Each layer has a plurality of weight values, and the calculation of one layer is performed by the calculation result of the previous layer and the plurality of weights of the current layer. Examples of neural networks include, but are not limited to, convolutional Neural Networks (CNNs), deep Neural Networks (DNNs), recurrent Neural Networks (RNNs), boltzmann machines limited (RBMs), deep Belief Networks (DBNs), bi-directional recurrent deep neural networks (BRDNNs), generation countermeasure networks (GANs), and deep Q networks.

A learning algorithm is a method of training a predetermined target device (e.g., a robot) using a plurality of learning data so that, allowing, or controlling the target device to make a determination or prediction. Examples of such learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

An electronic device is provided in an embodiment of the present application, including a memory, a processor, and a computer program stored on the memory, where the processor executes the computer program to implement the steps of the foregoing method embodiments.

In an alternative embodiment, an electronic device is provided, as shown in fig. 10, the electronic device 1000 shown in fig. 10 includes: a processor 1001 and a memory 1003. The processor 1001 is coupled to the memory 1003, such as via a bus 1002. Optionally, the electronic device 1000 may further include a transceiver 1004, where the transceiver 1004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data, etc. It should be noted that, in practical applications, the transceiver 1004 is not limited to one, and the structure of the electronic device 1000 is not limited to the embodiments of the present application.

The processor 1001 may be a CPU (Central Processing Unit ), general purpose processor, DSP (Digital Signal Processor, data signal processor), ASIC (Application Specific Integrated Circuit ), FPGA (Field Programmable Gate Array, field programmable gate array) or other programmable logic device, transistor logic device, hardware components, or any combination thereof. Which may implement or perform the various exemplary logic blocks, modules, and circuits described in connection with this disclosure. The processor 1001 may also be a combination that implements computing functionality, such as a combination comprising one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 1002 may include a path to transfer information between the components. Bus 1002 may be a PCI (Peripheral Component Interconnect, peripheral component interconnect standard) bus, or EISA (Extended Industry Standard Architecture ) bus, among others. The bus 1002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 10, but not only one bus or one type of bus.

The Memory 1003 may be, but is not limited to, ROM (Read Only Memory) or other type of static storage device that can store static information and instructions, RAM (Random Access Memory ) or other type of dynamic storage device that can store information and instructions, EEPROM (Electrically Erasable Programmable Read Only Memory ), CD-ROM (Compact Disc Read Only Memory, compact disc Read Only Memory) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, blu-ray discs, etc.), magnetic disk storage media, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer.

The memory 1003 is used to store a computer program for executing the embodiments of the present application, and is controlled to be executed by the processor 1001. The processor 1001 is arranged to execute a computer program stored in the memory 1003 to implement the steps shown in the foregoing method embodiments.

According to the embodiment of the application, in object pose estimation performed in an electronic device, output data of a feature in an image is obtained by using image data as input data of an artificial intelligence model. The artificial intelligence model may be obtained through training. Here, "obtained by training" means that a basic artificial intelligence model is trained with a plurality of training data by a training algorithm to obtain a predefined operating rule or artificial intelligence model configured to perform a desired feature (or purpose). The artificial intelligence model may include a plurality of neural network layers. Each of the plurality of neural network layers includes a plurality of weight values, and the neural network calculation is performed by calculation between the calculation result of the previous layer and the plurality of weight values.

Visual understanding is a technique for identifying and processing things like human vision and includes, for example, object recognition, object tracking, image retrieval, human recognition, scene recognition, 3D reconstruction/localization, or image enhancement.

Embodiments of the present application provide a computer readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, may implement the steps and corresponding content of the foregoing method embodiments.

The embodiments of the present application also provide a computer program product, which includes a computer program, where the computer program can implement the steps of the foregoing method embodiments and corresponding content when executed by a processor.

The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the present application described herein may be implemented in other sequences than those illustrated or otherwise described.

It should be understood that, although the flowcharts of the embodiments of the present application indicate the respective operation steps by arrows, the order of implementation of these steps is not limited to the order indicated by the arrows. In some implementations of embodiments of the present application, the implementation steps in the flowcharts may be performed in other orders as desired, unless explicitly stated herein. Furthermore, some or all of the steps in the flowcharts may include multiple sub-steps or multiple stages based on the actual implementation scenario. Some or all of these sub-steps or phases may be performed at the same time, or each of these sub-steps or phases may be performed at different times, respectively. In the case of different execution time, the execution sequence of the sub-steps or stages may be flexibly configured according to the requirement, which is not limited in the embodiment of the present application.

The foregoing is merely an optional implementation manner of some implementation scenarios of the present application, and it should be noted that, for those skilled in the art, other similar implementation manners based on the technical ideas of the present application are adopted without departing from the technical ideas of the solution of the present application, which also belongs to the protection scope of the embodiments of the present application.

Claims

1. An object pose estimation method, comprising:

determining key point information in an image to be processed;

2. The method of claim 1, wherein prior to determining keypoint information in the image to be processed, further comprising:

acquiring an input image;

and converting the input image into an image with a preset image style to obtain the image to be processed.

3. The method according to claim 2, wherein said converting said input image into an image of a predetermined image style, resulting in said image to be processed, comprises:

extracting image content characteristics of the input image through a content backbone network;

Acquiring preset image style characteristics;

and rendering the input image based on the fusion characteristics by a renderer to obtain the image to be processed.

4. A method according to claim 3, wherein the image content features are low resolution image content features; the image style feature is a low resolution image style feature.

5. The method according to any one of claims 1-4, wherein determining corrected keypoint information based on the keypoint feature map corresponding to the keypoint information comprises:

and obtaining the corrected key point information based on the key point offset residual information and the key point information.

6. The method of any one of claims 1-5, wherein the keypoint information is 2D keypoint information, the method further comprising:

Determining 3D key point mapping information of an object in the image to be processed based on the estimated object gesture in the image to be processed;

wherein the stop condition includes at least one of:

the key point correction reaches a predetermined number of times.

7. The method of claim 6, wherein the determining 3D keypoint mapping information for the object in the image to be processed based on the estimated pose of the object in the image to be processed comprises:

and determining 3D key point mapping information of the object in the image to be processed based on the estimated object gesture in the image to be processed, a preset 3D model set and a camera internal matrix.

8. The method of claim 6, wherein the difference before and after keypoint correction is less than a threshold comprises at least one of:

9. The method according to any one of claims 1-8, wherein determining keypoint information in the image to be processed comprises:

determining key point coordinate information and key point visible information in the image to be processed through a detector network;

and determining the key point information based on the key point coordinate information and the key point visible information.

10. The method as recited in claim 9, further comprising:

obtaining a RoI feature map of the region of interest of the image to be processed through the detector network;

the method further comprises the steps of:

and determining a corresponding key point feature map in the RoI feature map based on the key point information.

11. The method as recited in claim 9, further comprising:

when at least two objects are in the image to be processed, determining the object type in the image to be processed through the detector network;

the method further comprises the steps of:

And determining a corresponding key point feature map based on the object category and the key point information.

12. An object pose estimation method, comprising:

determining key point information in an image to be processed;

and estimating the object gesture in the input image based on the key point information.

13. An object posture estimation apparatus, characterized by comprising:

14. An object posture estimation apparatus, characterized by comprising:

15. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1-12.

16. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-12.