CN115471647A

CN115471647A - Pose estimation method and device, electronic equipment and storage medium

Info

Publication number: CN115471647A
Application number: CN202210975426.3A
Authority: CN
Inventors: 孙明珊; 郑烨; 暴天鹏; 陈建秋; 金国强; 吴立威; 赵瑞; 蒋小可
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2022-12-13

Abstract

The disclosure relates to a pose estimation method and device, electronic equipment and a storage medium, wherein a target image comprising an object to be detected is acquired, the target image is input into a pose recognition model obtained through training, a target area where the object to be detected is located in the target image is extracted, and a first image and a corresponding mask image are obtained. And then, carrying out external noise reduction on the first image according to the mask image to obtain a second image. And carrying out internal noise reduction and pose identification on the second image to obtain pose information of the object to be detected. In the pose estimation process, two noise reduction processes of external noise reduction and internal noise reduction are carried out on the object to be detected in the target image through the pose identification model, the influence of external noise and internal noise on pose identification is eliminated, and the accuracy of the pose information of the object is improved.

Description

Pose estimation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision technologies, and in particular, to a pose estimation method and apparatus, an electronic device, and a storage medium.

Background

6D pose estimation is an important issue in the field of computer vision research and in industrial manufacturing application scenarios. The application scenes of the intelligent robot comprise intelligent robot grabbing, automatic driving, augmented reality and the like. The pose estimation of the object can be usually performed by acquiring a target image with depth information, but physical noise is inevitably introduced when the depth information is acquired. Meanwhile, when the posture of the object needs to be estimated, corresponding noise is introduced into other regions except the object.

Disclosure of Invention

The disclosure provides a pose estimation method and device, electronic equipment and a storage medium, aiming at eliminating noise interference in a pose estimation process.

According to a first aspect of the present disclosure, there is provided a pose estimation method including:

acquiring a target image including an object to be detected;

inputting the target image into a pose recognition model, and extracting a target area where an object to be detected is located in the target image through the pose recognition model to obtain a first image and a mask image corresponding to the first image;

performing external noise reduction on the first image according to the mask image to obtain a second image;

and carrying out internal noise reduction and pose identification on the second image to obtain pose information of the object to be detected.

In a possible implementation manner, the inputting the target image into a pose recognition model, and extracting a target region where an object to be detected is located in the target image through the pose recognition model includes:

inputting the target image into a pose recognition model, and recognizing the pose recognition model to obtain a detection frame position and a corresponding segmentation mask corresponding to an object to be detected in the target image;

cutting the target image according to the position of the detection frame to obtain a first image;

and determining a mask image corresponding to the first image according to the segmentation mask.

In a possible implementation manner, the performing, according to the mask image, external noise reduction on the first image to obtain a second image includes:

determining a coordinate feature image and a normal vector feature image of the first image;

determining an image to be denoised according to the first image, the coordinate feature image and the normal vector feature image;

and performing dot multiplication on the mask image and the image to be denoised to obtain a second image.

In one possible implementation, the determining the coordinate feature image and the normal vector feature image of the first image includes:

calculating coordinate information of the object in the first image to obtain a first coordinate image;

performing coordinate back projection on the first coordinate image to obtain a second coordinate image, and determining the first coordinate image and the second coordinate image as coordinate feature images;

and calculating a normal vector of a depth channel in the first image to obtain a normal vector characteristic image.

In a possible implementation manner, the determining, according to the first image, the coordinate feature image, and the normal vector feature image, an image to be noise-reduced includes:

performing channel splicing on the first image, the coordinate feature image and the normal vector feature image to obtain a spliced image;

and carrying out position coding on the spliced image to obtain an image to be subjected to noise reduction.

In a possible implementation manner, the performing position coding on the stitched image to obtain an image to be noise-reduced includes:

carrying out position coding on at least one image channel in the spliced image to obtain a coded image;

and adding the at least one image channel in the spliced image and the corresponding image channel in the coded image to obtain an image to be subjected to noise reduction.

In one possible implementation manner, the performing internal noise reduction and pose identification on the second image includes:

performing feature extraction on the second image to obtain feature extraction information;

performing internal noise reduction and position identification on the feature extraction information to obtain the pose, the space coordinate information and the reprojection information of the object to be detected;

determining pose information including the pose, the spatial coordinate information, and the reprojection information.

In one possible implementation manner, the pose recognition model comprises an object extraction module for determining the second image and a pose recognition module for determining pose information, and the pose recognition module comprises a backbone network layer for feature extraction and a deep denoising layer for position recognition and internal denoising;

the training process of the pose recognition module comprises the following steps:

determining at least one sample first image and corresponding annotation reprojection information;

taking the sample first image as the input of the position identification module, and determining corresponding predicted pose information, wherein the predicted pose information comprises a predicted pose, a predicted coordinate and predicted reprojection information;

and adjusting a depth denoising layer in the position identification model according to the difference of the labeling reprojection information and the prediction reprojection information of the at least one sample first image. According to a second aspect of the present disclosure, there is provided a pose estimation apparatus including:

the image determining module is used for acquiring a target image comprising an object to be detected;

the region extraction module is used for inputting the target image into a pose recognition model, extracting a target region where an object to be detected in the target image is located through the pose recognition model, and obtaining a first image and a mask image corresponding to the first image;

the noise reduction module is used for carrying out external noise reduction on the first image according to the mask image to obtain a second image;

and the pose estimation module is used for carrying out internal noise reduction and pose identification on the second image to obtain pose information of the object to be detected.

In one possible implementation manner, the region extraction module includes:

the object identification submodule is used for inputting the target image into a pose identification model, and identifying the position of a detection frame corresponding to the object to be detected in the target image and a corresponding segmentation mask through the pose identification model;

the image cutting submodule is used for cutting the target image according to the position of the detection frame and obtaining a first image;

and the mask determining submodule is used for determining a mask image corresponding to the first image according to the segmentation mask.

In one possible implementation, the noise reduction module includes:

the characteristic image determining submodule is used for determining a coordinate characteristic image and a normal vector characteristic image of the first image;

the image to be subjected to noise reduction determining submodule is used for determining an image to be subjected to noise reduction according to the first image, the coordinate feature image and the normal vector feature image;

and the external noise reduction submodule is used for performing dot multiplication on the mask image and the image to be subjected to noise reduction to obtain a second image.

In one possible implementation, the feature image determination sub-module includes:

the first coordinate determination unit is used for calculating coordinate information of an object in the first image to obtain a first coordinate image;

the second coordinate determination unit is used for carrying out coordinate back projection on the first coordinate image to obtain a second coordinate image and determining the first coordinate image and the second coordinate image as coordinate feature images;

and the normal vector feature determination unit is used for calculating a normal vector of the depth channel in the first image to obtain a normal vector feature image.

In a possible implementation manner, the image to be noise-reduced determining sub-module includes:

the channel splicing unit is used for carrying out channel splicing on the first image, the coordinate characteristic image and the normal vector characteristic image to obtain a spliced image;

and the position coding unit is used for carrying out position coding on the spliced image to obtain an image to be denoised.

In one possible implementation manner, the position encoding unit includes:

the channel coding subunit is used for carrying out position coding on at least one image channel in the spliced image to obtain a coded image;

and the channel adding subunit is used for adding the at least one image channel in the spliced image with the corresponding image channel in the coded image to obtain an image to be subjected to noise reduction.

In one possible implementation, the pose estimation module includes:

the feature extraction sub-module is used for extracting features of the second image to obtain feature extraction information;

the internal noise reduction sub-module is used for carrying out internal noise reduction and position identification on the feature extraction information to obtain the pose, the space coordinate information and the re-projection information of the object to be detected;

and the pose information determining submodule is used for determining pose information comprising the pose, the space coordinate information and the reprojection information.

and adjusting a depth denoising layer in the position identification model according to the difference of the labeling reprojection information and the prediction reprojection information of the at least one sample first image.

According to a third aspect of the present disclosure, there is provided an electronic device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

According to a fourth aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

In the embodiment of the disclosure, a target image including an object to be detected is acquired, the target image is input into a pose recognition model obtained through training, a target area where the object to be detected is located in the target image is extracted, and a first image and a corresponding mask image are obtained. And then, carrying out external noise reduction on the first image according to the mask image to obtain a second image. And carrying out internal noise reduction and pose identification on the second image to obtain pose information of the object to be detected. According to the pose estimation method, in the pose estimation process, two noise reduction processes of external noise reduction and internal noise reduction are carried out on the object to be detected in the target image through the pose identification model, the influence of external noise and internal noise on pose identification is eliminated, and the accuracy of the pose information of the object is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow chart of a pose estimation method according to an embodiment of the present disclosure;

FIG. 2 shows a schematic structural diagram of a pose identification model according to an embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a process of determining a second image according to an embodiment of the disclosure;

FIG. 4 illustrates a schematic diagram of a process of determining pose information in accordance with an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of a pose identification process in accordance with an embodiment of the present disclosure;

fig. 6 shows a schematic diagram of a pose estimation apparatus according to an embodiment of the present disclosure;

FIG. 7 shows a schematic diagram of an electronic device according to an embodiment of the disclosure;

fig. 8 shows a schematic diagram of another electronic device according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of a, B, and C, and may mean including any one or more elements selected from the group consisting of a, B, and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.

In a possible implementation manner, the pose estimation method according to the embodiment of the present disclosure may be executed by an electronic device such as a terminal device or a server. The terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or other fixed or mobile terminal devices, and the server may be a single server or a server cluster formed by multiple servers. The electronic device may implement the pose estimation method by way of a processor invoking computer readable instructions stored in a memory. Optionally, the embodiments of the present disclosure may be applied to an application scenario in which pose estimation is performed on any object.

Fig. 1 shows a flowchart of a pose estimation method according to an embodiment of the present disclosure, and as shown in fig. 1, the pose estimation method of an embodiment of the present disclosure may include the following steps S10 to S40.

And S10, acquiring a target image comprising an object to be detected.

In a possible implementation manner, the electronic device acquires a target image corresponding to an object to be detected, of which the pose estimation needs to be performed, and the electronic device may acquire the target image directly by acquiring the object to be detected through a built-in or connected image acquisition device, or may acquire the target image by receiving an image uploaded by other electronic devices after acquiring the image of the object to be detected. The object to be detected can be any object needing posture estimation, such as animal objects like people, cats and dogs, and non-animal objects like water cups, furniture, vehicles and buildings.

Alternatively, in order to improve the accuracy of the pose estimation result, the target image may be an image including both a color channel and a depth channel, that is, including both color feature information and depth feature information. For example, the target image may be an RGB-D image including three color channels for characterizing color characteristics of each pixel in the target image and one depth channel for characterizing depth characteristics of each pixel in the target image.

Step S20, inputting the target image into a pose recognition model, and extracting a target area where an object to be detected is located in the target image through the pose recognition model to obtain a first image and a mask image corresponding to the first image.

In a possible implementation manner, after the electronic device determines a target image including an object to be detected, the electronic device may perform pose recognition on the target image through a pose recognition model obtained through pre-training to obtain pose information of the object to be detected. In the pose identification process, two stages of processing can be sequentially carried out on the target image, and noise reduction operation is carried out once in each stage of processing process to obtain pose information which is not interfered by noise. Optionally, the noise reduction operation performed in the first stage processing procedure is external noise reduction, and is used to remove noise introduced into a background region other than the object to be detected in the target image. And the noise reduction operation executed in the second stage processing process is internal noise reduction and is used for removing depth information of a target area to be detected of the target image and leading in the depth information to obtain internal noise.

Optionally, the process of performing the first stage processing on the pose recognition model may include extracting a target region where an object to be detected is located from a target image, to obtain a first image and a mask image corresponding to the first image. The first image can be determined by cutting a target area where an object to be detected in the target image is located, and the mask image can represent the position where the object to be detected in the first image is located. For example, the mask image may be a binary image composed of 0 and 1, and the pixel value corresponding to the position of the object to be detected may be set to 1, and the pixel value of the position of the object to be detected may be set to 0.

Further, the pose recognition model may determine the first image and the corresponding mask image by way of object recognition. That is, the pose recognition model may first determine the first image and the mask image by recognizing a detection frame position and a segmentation mask corresponding to an object to be detected in the target image. And then, cutting the target image according to the position of the detection frame to obtain a first image, and determining a mask image corresponding to the first image according to the segmentation mask. And the position of the detection frame and the segmentation mask are determined according to the position of the object to be detected. The detection frame can be a square frame and comprises an object to be detected and background information located around the object to be detected. The segmentation mask is used for segmenting background information and an object to be detected, and can represent the object to be detected and all background information except the object to be detected in a target image respectively through 1 and 0. The electronic device may directly cut the target image according to the position of the detection frame to obtain a first image, determine a segmentation mask image corresponding to the target image according to the segmentation mask, and cut the segmentation mask image according to the position of the detection frame to obtain a mask image corresponding to the first image.

Optionally, the pose recognition model may include two modules, configured to sequentially perform two-stage processing on the target image, and perform noise reduction operation once in each stage of processing, so as to obtain pose information that is not interfered by noise. For example, the pose recognition model may include two modules, namely an object extraction module and a pose recognition module, wherein the object extraction module is used for performing the first stage processing, and the pose recognition module is used for performing the second stage processing. At least one processing layer for performing different operations may be further included in the object extraction module and the pose recognition module. The noise reduction operation executed in the first stage processing procedure is external noise reduction and is used for removing noise introduced by a background region except for the object to be detected in the target image. And the noise reduction operation executed in the second stage processing process is internal noise reduction and is used for removing depth information of a target area to be detected of the target image and leading in the depth information to obtain internal noise.

Fig. 2 shows a schematic structural diagram of a pose recognition model according to an embodiment of the present disclosure. As shown in fig. 2, the pose recognition model 20 includes an object extraction module 21 and a pose recognition module 22 for performing two noise reduction operations, i.e., external noise reduction and internal noise reduction, respectively. The pose recognition module 22 may further include two backbone network layers 23 and a depth denoising layer 24 for performing different data processing.

Optionally, in the case that the pose recognition model includes an object extraction module, after the target image is input into the pose recognition model, the electronic device first performs a first stage of processing on the target image through the object extraction module in the pose recognition model, that is, extracts a target area where the object to be detected is located from the target image, and obtains the first image and a mask image corresponding to the first image.

Further, the object extraction module may determine the first image and the mask image by identifying a detection frame position and a segmentation mask corresponding to an object to be detected in the target image. The method comprises the steps of inputting a target image into a trained pose recognition model, and recognizing a detection frame position and a corresponding segmentation mask corresponding to an object to be detected in the target image through an object extraction module in the pose recognition model. And cutting the target image according to the position of the detection frame to obtain a first image, and determining a mask image corresponding to the first image according to the segmentation mask. The electronic equipment can directly cut the target image according to the position of the detection frame to obtain a first image, determine a segmentation mask image corresponding to the target image according to the segmentation mask, and then cut the segmentation mask image according to the position of the detection frame to obtain a mask image corresponding to the first image.

And S30, performing external noise reduction on the first image according to the mask image to obtain a second image.

In a possible implementation manner, the first-stage processing performed on the target image by the position recognition model further includes performing external noise reduction on the first image to obtain a second image, that is, after the first image and the mask image corresponding to the target image are determined, the external noise reduction may be performed on the first image according to the mask image to obtain the second image from which the external noise is removed. The external noise in the first image is noise introduced into a background region outside the region where the object to be detected is located, that is, the external noise reduction is performed on the noise introduced into the region outside the object to be detected. In the case where the object extraction module is further included in the pose recognition model, the first image may be subjected to external noise reduction by the object extraction module.

Optionally, in order to improve the accuracy of the external noise reduction effect, the coordinate feature and the normal vector feature of the first image may be extracted, and then the noise reduction operation may be performed based on the coordinate feature and the normal vector feature to obtain the second image. Namely, the external noise reduction process may include determining a coordinate feature image and a normal vector feature image of the first image, determining an image to be noise reduced according to the first image, the coordinate feature image and the normal vector feature image, and performing point multiplication on the mask image and the image to be noise reduced to obtain a second image. The coordinate feature image is used for representing the coordinate features of the object to be detected in the first image, and the normal vector features are determined according to the depth channel of the first image and used for representing the form of the object to be detected in the first image.

Optionally, the coordinate feature image may include two different images, and the process of determining the coordinate feature image and the normal vector feature image corresponding to the first image may include calculating coordinate information of an object in the first image to obtain a first coordinate image, performing coordinate back projection on the first coordinate image to obtain a second coordinate image, and determining the first coordinate image and the second coordinate image as the coordinate feature image. And then calculating a normal vector of a depth channel in the first image to obtain a normal vector characteristic image. The first coordinate image can be an image determined according to a two-dimensional UV coordinate of the first image, the size of the first coordinate image is the same as that of the first image, and the first coordinate image comprises a U channel corresponding to the U coordinate and used for storing the abscissa of the pixel point and a V channel corresponding to the V coordinate and used for storing the ordinate of the pixel point. The second coordinate image can be obtained by back-projecting the first coordinate image according to the internal reference matrix and the depth map corresponding to the camera for collecting the target image, and comprises two channels XY. The depth map may be a depth channel of the first image. The normal vector feature image can be determined by directly calculating the normal vector of the depth channel in the first image, and comprises three channels.

In a possible implementation manner, the image to be denoised corresponding to the first image may be determined through channel splicing and encoding, that is, after determining the coordinate feature map writing vector feature image corresponding to the first image, the coordinate feature image and the normal vector feature image are subjected to channel splicing to obtain a spliced image, and then the spliced image is subjected to position encoding to obtain the image to be denoised. The spliced image comprises 11 channels consisting of 4 channels in the first image, 2 channels in the first coordinate image, 2 channels in the second coordinate image and 3 channels in the normal vector characteristic image.

Optionally, the process of performing position coding on the spliced image may include performing position coding on channels included in the spliced image to obtain coded images, and adding the channels included in the spliced image to corresponding channels in the coded images to obtain the to-be-denoised image. For example, determining a UV coordinate corresponding to each channel in the stitched image, encoding the UV coordinate through a trigonometric function to obtain an encoded channel, and adding a result of position encoding of each channel to the corresponding channel in the stitched image to obtain an image to be denoised. After the image to be denoised is obtained, external denoising is realized by calculating the dot product result of the image to be denoised and the mask image, and a second image is obtained.

FIG. 3 illustrates a schematic diagram of a process of determining a second image according to an embodiment of the disclosure. As shown in fig. 3, after the electronic device determines a target image 30, an object extraction module 31 extracts a target region where an object to be detected is located in the target image to obtain a first image 32 and a corresponding mask image 33. After the coordinate feature image 34 and the normal vector feature image 35 corresponding to the first image 32 are determined, the first image 32, the coordinate feature image 34 and the normal vector feature image 35 are spliced through a channel to determine an image 36 to be denoised, and then the dot product result of the mask image 33 and the image 36 to be denoised is calculated to realize external denoising, so that a second image 35 with external noise introduced into a background region removed is obtained.

And S40, carrying out internal noise reduction and pose identification on the second image to obtain pose information of the object to be detected.

In a possible implementation manner, after the target image is subjected to object extraction and external noise reduction to obtain a second image, a second-stage processing can be further executed through a pose recognition model, that is, internal noise reduction and pose recognition are performed on the second image to obtain accurate pose information of the object to be detected. The process of the internal noise reduction and the pose identification can include the steps of extracting features of the second image to obtain feature extraction information, then performing internal noise reduction and position identification on the feature extraction information to obtain pose, space coordinate information and reprojection information of the object to be detected, and determining pose information including the pose, the space coordinate information and the reprojection information. Optionally, the internal noise is a difference noise of the depth information of the object to be detected, that is, the internal noise reduction is performed on the noise inside the object to be detected.

Optionally, under the condition that the pose recognition model includes a pose recognition module, and the pose recognition module may further include a backbone network layer and a depth denoising layer for performing different processing, the pose recognition module is configured to perform a second stage processing of the pose estimation process, where the second stage processing process performs feature extraction through the backbone network layer, and then performs internal denoising and position recognition according to the depth denoising layer, so as to obtain pose information. The embodiment of the disclosure can increase the internal noise reduction function of the pose recognition module by introducing the space coordinate information and the reprojection coordinate information constraint mode in the pose recognition module training process. The parameters of the pose recognition module can be adjusted according to pose constraint, space coordinate constraint and reprojection coordinate constraint in the training process, wherein the backbone network layer and the depth denoising layer are adjusted through the pose constraint and the space coordinate constraint, and the depth denoising layer parameters are adjusted through the reprojection coordinate constraint.

Optionally, the training process of the pose identification module may include determining at least one sample first image and corresponding annotation reprojection information. And taking the first sample image as the input of the position identification module, and determining corresponding predicted pose information, wherein the predicted pose information comprises a predicted pose, a predicted coordinate and predicted reprojection information. And adjusting a depth denoising layer in the position identification model according to the difference of the labeling reprojection information and the prediction reprojection information of the at least one sample first image. The sample first image is used as a sample first image without external noise, the labeling reprojection information can be determined through the labeling pose information of the sample first image, namely at least one sample first image and corresponding labeling pose information can be determined, and then the labeling reprojection information is determined according to the labeling pose information.

Alternatively, the annotation pose information can be manually annotated in advance or automatically determined. For example, the annotation object may be determined by CAD (Computer Aided Design) modeling, the pose of the annotation object may be transformed by rotating and translating the object multiple times, and a corresponding annotation first image may be acquired after each transformation of the pose. And after the pose is converted every time, determining marking pose information according to the rotation matrix and the real translation of the object. And determining labeling reprojection information according to the rotation amount R and the offset T in the labeling pose information and the coordinates (a, b, c) of each pixel point in the model coordinate system to obtain the coordinates (x, y, d) in the camera coordinate system. The reprojection information comprises coordinates of each pixel point in a camera coordinate system and normal vectors calculated according to depth d obtained by reprojection, and the reprojection information can comprise six channels consisting of a representation abscissa x channel, an ordinate y channel, a depth information d channel and three normal vector channels. The coordinates (x, y, d) in the reprojection information may be determined by the following formula:

in one possible implementation, after the sample first image and the annotation reprojection information are determined, the depth denoising layer may be adjusted according to a difference between the annotation reprojection information and the prediction reprojection information. Optionally, an annotation coordinate may be determined according to the annotation pose information of the first image of each sample, and then the backbone network layer and the depth denoising layer in the position identification model are adjusted according to the difference between the annotation pose information and the predicted pose information and the difference between the annotation coordinate and the predicted coordinate.

Fig. 4 shows a schematic diagram of a process of determining pose information according to an embodiment of the disclosure. As shown in fig. 4, the process of determining pose information according to the embodiment of the present disclosure may be to input the second image 40 to the backbone network layer 41 in the pose identification module to obtain the feature extraction information 42. And inputting the feature extraction information 42 into a depth denoising layer 43, removing internal noise, and performing pose estimation to obtain corresponding object pose 44, space coordinate information 45 and reprojection information 46 as accurate pose information.

Fig. 5 shows a schematic diagram of a pose identification process according to an embodiment of the present disclosure. As shown in fig. 5, the pose recognition process of the embodiment of the disclosure includes two stages, where the first stage is used to perform external noise reduction on a target image through an object extraction module in the pose recognition model to obtain a first image. And the second stage is used for extracting feature extraction information through the backbone network layer, and then carrying out internal noise reduction and pose estimation through the depth denoising layer to obtain accurate object pose, space coordinate information and re-projection information as accurate pose information.

Based on the characteristics, the embodiment of the disclosure can automatically remove two kinds of noises in the target image through one pose recognition model, and eliminate noise interference to obtain accurate pose information of the object to be detected. The external noise introduced by the background area is effectively filtered in the pose recognition model by determining the mask image, so that the accuracy of external noise reduction is improved. By introducing reprojection information constraint in the training process, the noise introduced by the depth information is automatically filtered in the pose estimation process, and the accuracy of internal noise reduction is improved. The embodiment of the disclosure further improves the accuracy of the pose information result by accurately and effectively filtering the external noise and the internal noise.

It is understood that the above-mentioned method embodiments of the present disclosure can be combined with each other to form a combined embodiment without departing from the logic of the principle, which is limited by the space, and the detailed description of the present disclosure is omitted. Those skilled in the art will appreciate that in the above methods of the specific embodiments, the specific order of execution of the steps should be determined by their function and possibly their inherent logic.

In addition, the present disclosure also provides a pose estimation apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the pose estimation methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are omitted for brevity.

Fig. 6 shows a schematic diagram of a pose estimation apparatus according to an embodiment of the present disclosure, and as shown in fig. 6, the pose estimation apparatus of an embodiment of the present disclosure may include:

an image determination module 60, configured to obtain a target image including an object to be detected;

the region extraction module 61 is configured to input the target image into a pose recognition model, extract a target region where an object to be detected in the target image is located through the pose recognition model, and obtain a first image and a mask image corresponding to the first image;

a noise reduction module 62, configured to perform external noise reduction on the first image according to the mask image to obtain a second image;

and the pose estimation module 63 is configured to perform internal noise reduction and pose identification on the second image to obtain pose information of the object to be detected.

In a possible implementation manner, the region extraction module 61 includes:

the object identification sub-module is used for inputting the target image into a pose identification model and identifying the pose identification model to obtain a detection frame position and a corresponding segmentation mask corresponding to an object to be detected in the target image;

In one possible implementation, the noise reduction module 62 includes:

the image to be denoised is determined by a submodule and is used for determining the image to be denoised according to the first image, the coordinate characteristic image and the normal vector characteristic image;

and the normal vector characteristic determining unit is used for calculating a normal vector of the depth channel in the first image to obtain a normal vector characteristic image.

In a possible implementation manner, the module for determining an image to be denoised includes:

and the position coding unit is used for carrying out position coding on the spliced image to obtain an image to be subjected to noise reduction.

In a possible implementation manner, the position encoding unit includes:

and the channel adding subunit is used for adding the at least one image channel in the spliced image and the corresponding image channel in the coded image to obtain an image to be subjected to noise reduction.

In one possible implementation, the pose estimation module 63 includes:

the characteristic extraction submodule is used for extracting the characteristics of the second image to obtain characteristic extraction information;

The method has specific technical relevance with the internal structure of the computer system, and can solve the technical problem of how to improve the hardware operation efficiency or the execution effect (including reducing data storage capacity, reducing data transmission capacity, improving hardware processing speed and the like), thereby obtaining the technical effect of improving the internal performance of the computer system according with the natural law.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile or non-volatile computer readable storage medium.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the memory-stored instructions to perform the above-described method.

The disclosed embodiments also provide a computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 7 shows a schematic diagram of an electronic device 800 according to an embodiment of the disclosure. For example, the electronic device 800 may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or other terminal device.

Referring to fig. 7, electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

Sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge Coupled Device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as a wireless network (Wi-Fi), a second generation mobile communication technology (2G), a third generation mobile communication technology (3G), a fourth generation mobile communication technology (4G), a long term evolution of universal mobile communication technology (LTE), a fifth generation mobile communication technology (5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

The disclosure relates to the field of augmented reality, and aims to detect or identify relevant features, states and attributes of a target object by means of various visual correlation algorithms by acquiring image information of the target object in a real environment, so as to obtain an AR effect combining virtual and reality matched with specific applications. For example, the target object may relate to a face, a limb, a gesture, an action, etc. associated with a human body, or a marker, a marker associated with an object, or a sand table, a display area, a display item, etc. associated with a venue or a place. The vision-related algorithms may involve visual localization, SLAM, three-dimensional reconstruction, image registration, background segmentation, key point extraction and tracking of objects, pose or depth detection of objects, and the like. The specific application can relate to interactive scenes such as navigation, explanation, reconstruction, virtual effect superposition display and the like related to a real scene or an article, and can also relate to special effect treatment related to people such as interactive scenes such as makeup beautification, limb beautification, special effect display, virtual model display and the like. The detection or identification processing of the relevant characteristics, states and attributes of the target object can be realized through the convolutional neural network. The convolutional neural network is a network model obtained by performing model training based on a deep learning framework.

Fig. 8 shows a schematic diagram of another electronic device 1900 in accordance with an embodiment of the disclosure. For example, the electronic device 1900 may be provided as a server or terminal device. Referring to fig. 8, electronic device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may further include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system, such as the Microsoft Server operating system (Windows Server), stored in the memory 1932 ^TM ) Apple Inc. of the present application based on the graphic user interface operating System (Mac OS X) ^TM ) Multi-user, multi-process computer operating system (Unix) ^TM ) Free and open native code Unix-like operating System (Linux) ^TM ) Opening, openingUnix-like operating system for native code release (FreeBSD) ^TM ) Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 1932, is also provided that includes computer program instructions executable by the processing component 1922 of the electronic device 1900 to perform the above-described methods.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but is not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK) or the like.

The foregoing description of the various embodiments is intended to highlight different aspects of the various embodiments that are the same or similar, which can be referenced with one another and therefore are not repeated herein for brevity.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

If the technical scheme of the application relates to personal information, a product applying the technical scheme of the application clearly informs personal information processing rules before processing the personal information, and obtains personal independent consent. If the technical scheme of the application relates to sensitive personal information, before the sensitive personal information is processed, a product applying the technical scheme of the application obtains individual consent and simultaneously meets the requirement of 'explicit consent'. For example, at a personal information collection device such as a camera, a clear and significant identifier is set to inform that the personal information collection range is entered, the personal information is collected, and if the person voluntarily enters the collection range, the person is considered as agreeing to collect the personal information; or on the device for processing the personal information, under the condition of informing the personal information processing rule by using obvious identification/information, obtaining personal authorization by modes of popping window information or asking a person to upload personal information of the person by himself, and the like; the personal information processing rule may include information such as a personal information processor, a personal information processing purpose, a processing method, and a type of personal information to be processed.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A pose estimation method, characterized in that the method comprises:

acquiring a target image including an object to be detected;

2. The method according to claim 1, wherein the inputting the target image into a pose recognition model, and the extracting a target area in which an object to be detected is located in the target image through the pose recognition model comprises:

3. The method according to claim 1 or 2, wherein the externally denoising the first image according to the mask image to obtain a second image comprises:

4. The method of claim 3, wherein determining the coordinate feature image and the normal vector feature image of the first image comprises:

5. The method according to claim 3 or 4, wherein the determining an image to be denoised according to the first image, the coordinate feature image and the normal vector feature image comprises:

and carrying out position coding on the spliced image to obtain an image to be denoised.

6. The method according to claim 5, wherein the position coding the stitched image to obtain the image to be noise-reduced comprises:

7. The method according to any one of claims 1-6, wherein the performing of the internal noise reduction and the pose identification on the second image comprises:

8. The method of claim 7, wherein the pose recognition model comprises an object extraction module for determining the second image and a pose recognition module for determining pose information, the pose recognition module comprising a backbone network layer for feature extraction and a depth denoising layer for position identification and internal denoising;

9. A pose estimation apparatus, characterized in that the apparatus comprises:

the region extraction module is used for inputting the target image into a pose recognition model, extracting a target region where an object to be detected is located in the target image through the pose recognition model, and obtaining a first image and a mask image corresponding to the first image;

10. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the memory-stored instructions to perform the method of any one of claims 1 to 8.

11. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any one of claims 1 to 8.