WO2022237688A1

WO2022237688A1 - Method and apparatus for pose estimation, computer device, and storage medium

Info

Publication number: WO2022237688A1
Application number: PCT/CN2022/091484
Authority: WO
Inventors: 贾配洋; 侯俊
Original assignee: 影石创新科技股份有限公司
Priority date: 2021-05-12
Filing date: 2022-05-07
Publication date: 2022-11-17
Also published as: CN113158974A

Abstract

The present application relates to a method and apparatus for pose estimation, a computer device, and a storage medium. The method comprises: acquiring a target image to undergo pose estimation, the target image comprising a target subject to be processed; performing feature extraction on the basis of the target image, and acquiring a first extracted feature; performing feature expansion on the first extracted feature by means of an image feature expansion network, and obtaining an expanded image feature; performing feature extraction on the expanded image feature, and obtaining a second extracted feature; performing feature compression on the second extracted feature by means of an image feature compression network, and obtaining a compressed image feature; determining, on the basis of the compressed image feature, key point location information corresponding to the target subject in the target image, and performing pose estimation on the target subject on the basis of the key point location information. The present method can improve the efficiency of pose estimation.

Description

Attitude estimation method, device, computer equipment and storage medium

technical field

The present application relates to the technical field of computer vision, in particular to a pose estimation method, device, computer equipment and storage medium.

Background technique

With the development of computer vision technology, attitude estimation, as one of the important applications in computer vision, has also developed rapidly and is widely used in object activity analysis, video surveillance or object interaction and other fields. For example, human body pose estimation in pose estimation, through human body pose estimation, can detect various key points of a human body in an image containing a human body. For example, the facial features, limbs or joints of the human body can be obtained through human body pose estimation. Because of its functions, it is widely used in scenes such as stop-motion animation, collage dance, transparent people, walking stitching or action classification.

technical problem

However, current pose estimation methods suffer from low efficiency.

technical solution

Based on this, it is necessary to provide an attitude estimation method, device, computer equipment and storage medium capable of improving the efficiency of attitude estimation in order to address the above technical problems.

A pose estimation method, the method comprising: acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed; performing feature extraction based on the target image to obtain a first extracted feature; The expansion network performs feature expansion on the first extraction feature to obtain an expanded image feature; performs feature extraction on the expanded image feature to obtain a second extraction feature; performs feature compression on the second extraction feature through an image feature compression network, Obtaining compressed image features; determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.

In one of the embodiments, the image feature expansion network includes a plurality of feature convolution channels, and performing feature expansion on the first extracted feature through the image feature expansion network, and obtaining the expanded image feature includes: The extracted features are respectively input into a plurality of feature convolution channels corresponding to the image feature expansion network, and each of the feature convolution channels uses a feature dimension preserving convolution kernel to convolve the first extracted feature to obtain each of the feature The convolution features output by the convolution channels; the expanded image features are obtained by combining the convolution features output by each of the feature convolution channels.

In one of the embodiments, the determining the key point position information corresponding to the target object in the target image based on the compressed image features includes: amplifying the compressed image features to obtain enlarged image features; The enlarged image features are convolved to obtain third extracted features; based on the third extracted features, key point position information corresponding to the target object in the target image is determined.

In one of the embodiments, the acquiring the target image to be subjected to pose estimation includes: acquiring an initial image; performing object detection on the initial image to obtain probabilities that each of the multiple candidate image regions in the initial image includes the target object; Based on the probability that the candidate image area includes the target object, select from the candidate image area to obtain an object image area including the target object; according to the object image area, extract an intercepted image area from the initial image, and extract the intercepted image as the target image for pose estimation.

In one of the embodiments, the extraction of the intercepted image area from the initial image according to the object image area, and using the intercepted image as the target image for pose estimation includes: obtaining the image area in the object image area the center coordinates of the object image area; obtain the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; The extension direction is extended to obtain the extension coordinates; the image area located within the extension coordinates is used as the image interception area, and the intercepted image is used as the target image to be subjected to pose estimation.

In one embodiment, there are multiple key point position information, and the method further includes: according to the mapping relationship between the key point position information and the target point position information, converting each key point position information into a corresponding The target point position information; The target point position information is the position information of the key point position information in the initial image; Based on each of the target position information, perform pose estimation on the target object to obtain the target image corresponding target pose.

A method for generating a target video, the method further comprising: acquiring a target action, determining a gesture sequence corresponding to the target action, and performing the gestures in the gesture sequence in order to obtain the target action; performing the above gesture estimation The method is to obtain the target pose corresponding to each target image in the target image set; to obtain the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; according to the sorting of the poses in the pose sequence The obtained video frame images are arranged to obtain the target video corresponding to the target action.

A pose estimation device, the device comprising: a target image acquisition module, for acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed; a first feature extraction module, for based on the The target image is feature extracted to obtain the first extracted feature; the expanded image feature obtaining module is used to expand the first extracted feature through the image feature expansion network to obtain the expanded image feature; the second extracted feature obtained module is used for Feature extraction is performed on the expanded image features to obtain second extracted features; a compressed image feature obtaining module is used to perform feature compression on the second extracted features through an image feature compression network to obtain compressed image features; key point position information is determined A module, configured to determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.

In one of the embodiments, the expanded image feature obtaining module is used to input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, each of the feature convolution channels utilizes a feature dimension Keeping the convolution kernel to convolve the first extracted features to obtain the convolution features output by each of the feature convolution channels; combining the output convolution features of each of the feature convolution channels to obtain the expanded image features.

In one of the embodiments, the key point location information determination module is used to amplify the compressed image features to obtain enlarged image features; perform convolution on the enlarged image features to obtain a third extracted feature; based on the The third feature extraction is to determine key point position information corresponding to the target object in the target image.

In one of the embodiments, the target image acquisition module is used to acquire an initial image; object detection is performed on the initial image to obtain the probability that a plurality of candidate image areas in the initial image respectively include the target object; based on the candidate image areas The probability of including the target object is selected from the candidate image area to obtain the object image area including the target object; according to the object image area, the intercepted image area is extracted from the initial image, and the intercepted image is used as the image to be estimated. target image.

In one of the embodiments, the target image acquisition module is used to acquire the center coordinates in the object image area; acquire the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; Based on the central coordinates and the area extension value, extend to the extension direction corresponding to the area extension value to obtain extension coordinates; use the image area within the extension coordinates as the intercepted image area, and use the intercepted image as The target image to be pose estimated.

In one of the embodiments, the target image acquisition module is used to convert each of the key point position information into corresponding target point position information according to the mapping relationship between the key point position information and the target point position information; the target point The position information is the position information of the key point position information in the initial image; based on each of the target position information, pose estimation is performed on the target object to obtain a target pose corresponding to the target image.

A target video generation device, the device is used to acquire target actions, determine the gesture sequence corresponding to the target action, the gestures in the gesture sequence are executed in order to obtain the target action; acquire each target image set The target pose corresponding to the target image; obtaining the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; pairing the obtained video frame images according to the sorting of the poses in the pose sequence Arranging to obtain the target video corresponding to the target action.

A computer device, comprising a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program: acquiring a target image to be subjected to pose estimation; the target image includes a target image to be processed The target object; Based on the target image, feature extraction is performed to obtain the first extraction feature; the image feature expansion network is used to perform feature expansion on the image extraction feature to obtain the expanded image feature; the feature extraction is performed on the expanded image feature to obtain The second extraction feature; performing feature compression on the second extraction feature through an image feature compression network to obtain a compressed image feature; determining key point position information corresponding to the target object in the target image based on the compressed image feature, Estimating the pose of the target object based on the position information of the key points.

A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the following steps are implemented: acquiring a target image to be subjected to attitude estimation; the target image includes a target object to be processed; based on The target image is subjected to feature extraction to obtain the first extracted feature;

Feature expansion is performed on the image extraction feature by an image feature expansion network to obtain an expanded image feature; feature extraction is performed on the expanded image feature to obtain a second extraction feature; and the second extraction feature is obtained by an image feature compression network. compressing to obtain compressed image features; determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.

technical effect

The above attitude estimation method, device, computer equipment and storage medium obtain the target image to be subjected to attitude estimation; the target image includes the target object to be processed; perform feature extraction based on the target image to obtain the first extracted feature; The expansion network performs feature expansion on the first extraction feature to obtain the expanded image feature; performs feature extraction on the expanded image feature to obtain the second extraction feature; performs feature compression on the second extraction feature through the image feature compression network to obtain the compressed image feature; Compress image features to determine the key point position information corresponding to the target object in the target image, and perform pose estimation on the target object based on the key point position information. First, the image feature expansion network is used to expand the extracted first extraction features, so that as many image features as possible can be input at the input end of the pose estimation network, and then the second extraction features are characterized by the image feature compression network. Compression, combined with the above features, can achieve the purpose of improving the efficiency and accuracy of pose estimation.

Description of drawings

Fig. 1 is an application environment diagram of a posture estimation method in an embodiment;

Fig. 2 is a schematic flow chart of a pose estimation method in an embodiment;

Fig. 3 is a schematic flow chart of a pose estimation method in another embodiment;

FIG. 4 is a schematic flow chart of a pose estimation method in another embodiment;

FIG. 5 is a schematic flow chart of a pose estimation method in another embodiment;

Fig. 6 is a schematic flow chart of a pose estimation method in another embodiment;

FIG. 7 is a schematic flow chart of a pose estimation method in another embodiment;

Fig. 8 is a schematic flow chart of a method for generating a target video in an embodiment;

Fig. 9 is a schematic diagram of a panoramic image including objects in one embodiment;

Fig. 10 is a schematic diagram of detecting an object in an embodiment;

Fig. 11 is a schematic diagram of intercepting object subgraphs in an embodiment;

Fig. 12 is a schematic diagram of object key points in an embodiment;

Fig. 13 is a schematic diagram of a human body posture model in an embodiment;

Fig. 14 is a structural block diagram of a pose estimation device in an embodiment;

Figure 15 is a diagram of the internal structure of a computer device in one embodiment.

Embodiments of the present invention

In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

The attitude estimation method provided in this application can be applied to the application environment shown in FIG. 1 , and specifically applied to an attitude estimation system. The pose estimation system includes an image acquisition device 102 and a terminal 104 , wherein the image acquisition device 102 and the terminal 104 are connected in communication. The terminal 104 executes a pose estimation method. Specifically, the terminal 104 acquires a target image to be subjected to pose estimation transmitted from the image acquisition device 102; the target image includes a target object to be processed; the terminal 104 performs feature extraction based on the target image , to obtain the first extraction feature; expand the first extraction feature through the image feature expansion network to obtain the expanded image feature; perform feature extraction on the above-mentioned expanded image feature to obtain the second extraction feature; use the image feature compression network to obtain the second extraction feature The feature is compressed to obtain the compressed image feature; the key point position information corresponding to the target object in the target image is determined based on the compressed image feature, and the pose estimation of the target object is performed based on the key point position information. Wherein, the image acquisition device 102 may be, but not limited to, various devices with an image acquisition function, and may be distributed outside the terminal 104 or inside the terminal 104 . For example: various cameras, scanners, various cameras, and image acquisition cards distributed outside the terminal 104 . The terminal 104 can be, but is not limited to, various cameras, personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. It can be understood that the method provided in the embodiment of the present application may also be executed by a server.

In one embodiment, as shown in FIG. 2 , a pose estimation method is provided. The method is applied to the terminal in FIG. 1 as an example for illustration, including the following steps:

Step 202, acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed.

Among them, pose estimation refers to the process of estimating the pose of the target object by detecting key points in the target object and describing one or more key points. Key points refer to feature points that can describe the structural features of the target object. For example, the facial features, leg joints, or hand joints of the target object. The target object refers to the object for pose estimation. For example, human body or animal etc.

Specifically, the terminal can directly or indirectly obtain the target image to be subjected to pose estimation.

In an embodiment, the terminal takes the received image including the target object to be processed and transmitted from the image acquisition device as the target image.

In one embodiment, the terminal preprocesses the received image transmitted from the image acquisition device, and uses the preprocessed image as the target image.

In one embodiment, the image collection device is a panoramic camera. After the panoramic camera collects the panoramic image, the panoramic image is used as a target image to be subjected to pose estimation, and the target image includes a target object to be processed. The target object can be complete, only partially contained or occluded.

In one embodiment, the terminal acquires a panoramic image by extracting frames from the panoramic video, and directly or after preprocessing the panoramic image, obtains a target image to be subjected to pose estimation. The preprocessing includes normalizing the panoramic image or cropping the target object in the panoramic image.

Step 204, perform feature extraction based on the target image, and obtain first extracted features.

Wherein, a feature is information representing a specific attribute of a target image, through which an object in the target image can be identified or the target image can be classified.

Specifically, feature extraction may be performed on the target image through a feature extraction network to obtain the first extracted feature.

In one embodiment, the feature extraction of the target image may be performed through a lightweight deep neural network to obtain the first extracted feature.

Step 206, performing feature expansion on the first extracted feature through the image feature expansion network to obtain expanded image features.

Among them, the image feature expansion network refers to a network that can increase the number of image features. The expanded image feature refers to the image feature after the image feature is expanded.

Specifically, the acquired channels of the first extracted features are expanded by point-by-point convolution, the number of features is enriched, and the expanded image features are obtained.

In one embodiment, after the terminal obtains the first extracted features, the image feature expansion network uses 1*1 point-by-point convolution to perform feature expansion on the image extracted features to obtain expanded image features.

Step 208, perform feature extraction on the features of the expanded image to obtain second extracted features.

Specifically, after the expanded image features are obtained, feature extraction may be performed on the expanded image features through convolution with fewer parameters to obtain the second extracted features.

In one embodiment, the expanded image features may be down-sampled by preset convolution and activation functions, and feature extraction may be performed on the expanded image features to obtain the second extracted features. For example, the preset convolution can be 3*3 convolution and a ReLU (Rectified Linear Unit, rectified linear unit) activation function to perform feature extraction on the expanded image feature to obtain the second extracted feature. Based on the adaptability of each application scenario, the above activation function can use Sigmoid function (Sigmoid function, S-type growth curve), ELU (Exponential Linear Unit, Exponential Linear Unit), GELU (Gaussian Error Linear Unit, Gaussian error linear unit) and other replacements.

Step 210, perform feature compression on the second extracted features through the image feature compression network to obtain compressed image features.

Among them, the image feature compression network refers to a network that can reduce the number of image features. Compressed image features refer to image features after image features are compressed.

Specifically, after obtaining the second extracted features, the terminal may perform feature compression on the second extracted features, so as to increase the speed of the terminal's pose estimation.

In one embodiment, after the terminal obtains the second extracted features, the image feature compression network uses 1*1 point-by-point convolution to perform feature compression on the second extracted features, and obtains compressed image features after linear transformation.

Step 212 , determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.

Wherein, the key point position information refers to information capable of determining the position of the key point in the target image. For example, information such as the coordinates, names or directions of the key points in the target image.

Specifically, after obtaining the compressed image feature, the terminal may obtain the key point position information corresponding to the target object through the correspondence between the compressed image feature and the key point position information corresponding to the target object.

In one embodiment, the terminal stores a matching relationship table between compressed image features and key point location information. After obtaining the compressed image features, the terminal obtains corresponding key point location information by traversing the above matching relationship table. According to the position coordinates and names in the key point position information, the pose estimation of the target object is performed. For example, the position information of the key point obtained by the terminal after traversing the above matching relationship table is (200, 200, wrist joint), indicating that the position coordinates of the key point are (200, 200), and the key point is the wrist joint. Describe the position information of multiple key points above, and estimate the pose.

In the above-mentioned attitude estimation method, by obtaining the target image to be subjected to attitude estimation; the target image includes the target object to be processed; feature extraction is performed based on the target image to obtain the first extraction feature; the first extraction feature is obtained through the image feature expansion network Perform feature expansion to obtain the expanded image features; perform feature extraction on the expanded image features to obtain the second extracted features; perform feature compression on the second extracted features through the image feature compression network to obtain compressed image features; determine the target image based on the compressed image features The target object corresponds to the key point position information, and the pose estimation of the target object is performed based on the key point position information. Firstly, the feature expansion of the extracted first extraction features is performed through the image feature expansion network, so that as many image features as possible can be input at the input end of the pose estimation network, and then the second extraction features are characterized by the image feature compression network. Compression, combined with the above features, can achieve the purpose of improving the efficiency and accuracy of pose estimation.

In one embodiment, as shown in FIG. 3 , the image feature expansion network includes a plurality of feature convolution channels, and the first extracted feature is subjected to feature expansion through the image feature expansion network, and the expanded image features obtained include:

Step 302, input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, and each feature convolution channel uses a feature dimension-preserving convolution kernel to convolve the first extracted features to obtain each feature convolution The convolutional features output by the channel.

Among them, the feature dimension preserving convolution kernel refers to the convolution kernel that can keep the dimension of the image unchanged, and the dimension of the image refers to the number of channels of the image. For example, a convolution kernel of size 1*1.

Specifically, the terminal can maintain the convolution kernel by setting the feature dimension, and perform convolution on the first extracted feature from each feature convolution channel, which can be obtained by using fewer parameters while keeping the scale of the first extracted feature unchanged. The convolutional features output by the feature convolution channel.

In one embodiment, the terminal performs convolution on the image extraction features of each feature convolution channel by setting the feature dimension to maintain the number and size of the convolution kernels, and obtains the convolution features output by the feature convolution channel. For example, on a 3*3 convolutional network with a network of 64 channels, after adding a convolution kernel with a size of 1*1 and a channel number of 256, the original network can be converted to 64*256 parameters. The number of channels has been expanded from 64 to 256.

In step 304, the convolution features output by each feature convolution channel are synthesized to obtain expanded image features.

Specifically, after obtaining the convolution feature output by the feature convolution channel, the feature dimension-preserving convolution kernel can linearly combine each pixel in the first extracted feature in the image on different channels to obtain the expanded image feature . For example, the composition of the expanded network is to add a 1*1, 28-channel convolution kernel behind the 3*3, 64-channel convolution kernel, and convert it into a 3*3, 28-channel convolution kernel. The original 64 The channels are linearly combined across channels to become 28 channels, which realizes the information interaction between channels, and obtains the expanded image features through the convolution features output by each feature convolution channel.

In this embodiment, in the image feature expansion network, by using the feature dimension preserving convolution kernel to convolve the first extracted feature in the feature convolution channel, the convolution feature output by the feature convolution channel is obtained, and each feature convolution The convolution feature output by the product channel is used to obtain the expanded image feature, which can achieve the purpose of obtaining the expanded image feature with fewer parameters, thereby improving the efficiency of pose estimation.

In one embodiment, as shown in FIG. 4, determining the key point position information corresponding to the target object in the target image based on the compressed image features includes:

Step 402, amplifying the compressed image features to obtain enlarged image features.

Specifically, after the compressed image features are obtained, the enlarged image features are obtained by upsampling the features.

In one embodiment, the terminal amplifies the compressed image features, and sets the number of input and output channels of the three-layer sampling network to (256, 128), (128, 64), (64, 64 ), which can reduce the amount of network parameters and computation.

In one embodiment, the terminal performs interpolation calculation on the compressed image features through an interpolation method to obtain enlarged image features. For example, on the basis of compressed image features, new elements are inserted between pixels using appropriate interpolation algorithms such as linear interpolation or bilinear interpolation.

Step 404, performing convolution on the enlarged image features to obtain a third extracted feature.

Specifically, after the enlarged image features are obtained, in order to compensate for the reduction of nonlinear units during the process of enlarging the compressed image features, the enlarged image features are convolved to obtain the third extracted features.

Step 406, based on the third extracted feature, determine key point position information corresponding to the target object in the target image.

Specifically, after the third extracted features are obtained, searching and filtering are performed on the third extracted features to determine key point position information corresponding to the target object in the target image.

In one embodiment, the terminal stores a matching list of image features and key point position information. After obtaining the third extracted feature, the matching list is traversed to obtain the key point position information corresponding to the third extracted feature, that is, the key point position information in the target image The key point position information corresponding to the target object.

In this embodiment, by amplifying the compressed image features, the enlarged image features are obtained, and the enlarged image features are convolved to obtain the third extraction feature, based on the third extraction feature, the key corresponding to the target object in the target image is determined The point position information can achieve the purpose of improving the image output quality, so as to achieve the purpose of obtaining the key point position information corresponding to the target object more accurately.

In one embodiment, as shown in FIG. 5 , acquiring a target image to be subjected to pose estimation includes:

Step 502, acquiring an initial image.

Among them, the initial image refers to the unprocessed original image. The original image is an image obtained directly by an image acquisition device or a terminal.

In one embodiment, the terminal can collect the initial image through the connected image acquisition device, and the acquisition device transmits the acquired initial image to the terminal in real time; or the acquisition device temporarily stores the acquired initial image locally in the acquisition device, when When receiving an image acquisition instruction from the terminal, the locally stored initial image is transmitted to the terminal, and accordingly, the terminal can acquire the initial image.

In one embodiment, the terminal collects the initial image through an internal image collection module, stores the collected image in the terminal memory, and obtains the initial image from the memory when the terminal needs to obtain the initial image.

Step 504, perform object detection on the initial image, and obtain probabilities that each of the multiple candidate image regions in the initial image includes the target object.

Specifically, after the initial image is acquired, the initial image is divided into multiple image sub-regions as candidate image regions, and the probability of the target object in each candidate image region is detected. For example, if the image is divided into sub-region A, sub-region B and sub-region C, the probability of the target object in sub-region A is 0%, the probability of the target object in sub-region B is 10%, and the probability of the target object in sub-region C is The probability is 90%.

In an embodiment, the terminal obtains the probability that each image sub-region includes the target object by gradually reducing the size of the image sub-regions.

Step 506 , based on the probability that the candidate image region includes the target object, an object image region including the target object is selected from the candidate image regions.

Specifically, after obtaining the probabilities that the plurality of candidate image regions in the initial image respectively include the target object, the terminal may compare the probabilities of each candidate image region to obtain candidate image regions whose probabilities are within a preset probability threshold range, and use the candidate The image area serves as an object image area including the target object.

In one embodiment, the terminal traverses the probability that the candidate image areas include the target object, obtains the maximum probability value among the probabilities, and uses the candidate image area corresponding to the maximum probability value as the target image area of the target object.

Step 508 , according to the target image area, extract the intercepted image area from the initial image, and use the intercepted image as the target image to be subjected to pose estimation.

Specifically, after the terminal obtains the object image area including the target object, based on the position information of the object image area, the image of the obtained image area can be intercepted as the target image to be subjected to pose estimation, so as to reduce the amount of computation during pose estimation, Improve pose estimation efficiency.

In one embodiment, the terminal can extract the coordinate information of the object image area, and use the coordinate information to intercept and obtain the target image for pose estimation.

In one embodiment, by receiving a user's target object frame selection operation, the frame-selected image area may be used as the target image area, and the frame-selected image area may be intercepted from the initial image as the target image to be pose estimated.

In this embodiment, by acquiring the initial image and performing object detection on the initial image, the probabilities that the multiple candidate image areas in the initial image respectively include the target object are obtained, and based on the probability that the candidate image areas include the target object, the candidate image areas including the target object are selected to obtain The object image area of the target object is extracted from the initial image to obtain the object image area, and the intercepted image is used as the target image for pose estimation, so as to achieve the purpose of accurately obtaining the target image from the initial image.

In one embodiment, as shown in FIG. 6, according to the object image area, the intercepted image area is extracted from the initial image, and the intercepted image is used as the target image to be subjected to pose estimation including:

Step 602, acquiring the center coordinates in the object image area.

Wherein, the center coordinates refer to the coordinates of the pixel at the center point of the object image area. The coordinates are based on the coordinates of the pixel at the center of the object image area of the target object in the initial image, and can be determined according to the length and width of the initial image.

Specifically, after the target image area is determined, the terminal obtains the pixel point at the center of the target image area, and obtains the coordinates of the pixel point through the pixel point coordinate obtaining tool.

In step 604, the area size corresponding to the target image area is obtained, and the area extension value is obtained based on the area size and the size expansion coefficient.

Here, the area size refers to the area length and area width of the target image area. For example, if the area length of the target image area is h, and the area width of the target image area is w, then the area size is w*h, and the size expansion coefficient refers to a coefficient that can increase the area size. The area extension value refers to the growth value of the area size obtained by correcting the area size with the size expansion coefficient.

Specifically, after acquiring the center coordinates in the target image area, the terminal may then acquire the area size corresponding to the target image area, and obtain the area extension value through the functional relationship between the area size and the size expansion coefficient.

In one embodiment, the terminal obtains the area size corresponding to the object image area through the image size measurement tool, and obtains the area extension value by using the product relationship between the area size and the size expansion coefficient. For example, the area width in the area size is w, the size expansion coefficient is exp_ratio, and the area extension value of the area width is w*exp_ratio*1.2/2; similarly, the area extension value of the area length in the area size can also be passed The corresponding dimensional expansion coefficients are obtained.

Step 606, based on the center coordinates and the area extension value, extend to the extension direction corresponding to the area extension value to obtain the extension coordinates.

Wherein, the extension direction refers to the direction in which the width and length increase corresponding to the area extension value. The extension coordinates refer to the coordinates of the object image area obtained by extending the object image area in the extension direction with the center coordinates as a reference point. The coordinates may be represented by the coordinates of the upper left corner and the lower right corner of the object image area.

Specifically, after obtaining the area extension value, the terminal may use the center coordinates as a reference point to expand the target image area by using the area extension value to obtain the extension coordinates corresponding to the target image area. In order to enable the object image area to include a more complete target object.

In one embodiment, the center coordinates are expressed as (x, y), and the extension coordinates are (x0, y0) and (x1, y1), where x0 and x1 are the coordinates of the extension value of the object image area in the direction of the width extension of the image , y0 and y1 are the coordinates of the object image area extension value in the length extension direction of the image, then the extension coordinates can be expressed as the formula:

x0 = int(x - w * exp_ratio * 1.2 / 2)

x1 = int(x + w * exp_ratio * 1.2 / 2)

y0 = int(y - h * exp_ratio * 0.8 / 2)

y1 = int(y + h * exp_ratio * 0.8 / 2)

In one embodiment, when the area extension value in the width extension direction in the area extension value is less than or equal to 0, the extension value is zero; when the area extension value is greater than or equal to the width of the initial image, the width of the initial image is used as the area extension value . Similarly, when the area extension value in the length extension direction in the area extension value is less than or equal to zero, the area extension value is zero; when the area extension value is greater than or equal to the height of the initial image, the height of the initial image is used as the area extension value.

Step 608 , taking the image area within the extended coordinates as the intercepted image area, and using the intercepted image as the target image to be subjected to pose estimation.

Specifically, after acquiring the extended coordinates, the terminal may use the image area within the extended coordinates as the intercepted image area, and use the intercepted image as the target image for pose estimation.

In this embodiment, by obtaining the center coordinates in the object image area and the area size corresponding to the object image area, the area extension value is obtained based on the area size and the size expansion coefficient, and the area extension value corresponding to the area extension value is obtained based on the center coordinates and the area extension value. The extension direction is extended to obtain the extension coordinates, and the image area located in the extension coordinates is used as the intercepted image area, and the intercepted image is used as the target image to be estimated for attitude, which can achieve the purpose of accurately intercepting the target image, and then improve the performance of attitude estimation. efficiency.

In one embodiment, as shown in Figure 7, there are multiple key point position information, and the method also includes:

Step 702: According to the mapping relationship between the key point position information and the target point position information, each key point position information is converted into corresponding target point position information; the target point position information is the position information of the key point position information in the initial image.

Wherein, the location information refers to information related to the location that can reflect a certain location point. The position point may be a key point, or other points having the same structure or function as the key point. The position-related information may be coordinate information of a key point or information describing the position of the key point. For example, keypoint location information can be expressed as (100,100, eyes).

Specifically, there is a one-to-one correspondence between the key point position information in the target image and the target point position information in the initial image, and they can be converted to each other. After knowing the position information of the key point, the position information of the key point in the initial image can be correspondingly obtained, so that the position information in the initial image obtained in the target image can be accurately reflected in the initial image.

In one embodiment, the key point position information of the jth key point is expressed as (x_keypoints_j, y_keypoints_j), and the vertex coordinates of the i-th target image in the upper left corner of the initial image are expressed as (x_person_i, y_person_i), the key point is in The coordinates in the initial image are expressed as (x_original_keypoints, y_original_keypoints), then the coordinates of the key point in the initial image are expressed as the formula:

x_original_keypoints = x_person_i + x_keypoints_j

y_original_keypoints = y_person_i + y_keypoints_j

Step 704, perform pose estimation on the target object based on each target position information, and obtain the target pose corresponding to the target image.

Specifically, after determining each target position information corresponding to a plurality of key points, the terminal performs pose estimation through the corresponding relationship between the specific type of the key point and the target position information, and obtains the target pose corresponding to the target image.

In this embodiment, through the mapping relationship between the key point position information and the target point position information, the position information of each key point is converted into the corresponding target point position information, and the pose estimation of the target object is performed through each target position information to obtain the corresponding position of the target image. target attitude. The purpose of obtaining the target pose in the target image can be achieved.

In one embodiment, as shown in Figure 8, the target video generation method includes:

Step 802, acquire the target action, and determine the gesture sequence corresponding to the target action, wherein the gestures in the gesture sequence are executed in order to obtain the target action.

Wherein, the target action refers to an action obtained after each gesture is executed in sequence. Gesture refers to the individual sub-actions that make up an action. For example, the target action is an arm stretching movement, and the target action is composed of multiple sub-actions such as placing the arms flat, straightening the arms, and turning the arms together sideways. Multiple gestures can form a gesture sequence according to the order of the front and back, and when the sequence of gestures is executed sequentially, the target action is obtained.

Step 804, according to the steps in the above method embodiments, acquire the target pose corresponding to each target image in the target image set.

Specifically, after acquiring the target action, the terminal can obtain various poses based on the above-mentioned pose estimation methods, and each pose constituting the target action is in a different target image, and the corresponding poses can be obtained from each target image. For example, the target pose F exists in the target image E, or the target pose H exists in the target image G, and so on.

In one embodiment, according to the corresponding relationship between the pose and the image, the target pose corresponding to each target image is obtained in the target image set.

Step 806, acquire images corresponding to each target pose in the pose sequence from the target image set as video frame images.

Specifically, after acquiring the target pose, the terminal obtains images corresponding to the target pose according to the target pose, and uses each obtained image as a video frame image.

In one embodiment, the time stamps corresponding to the images corresponding to the target poses are obtained, and the images carrying the respective time stamps are used as video frame images.

Step 808, arrange the obtained video frame images according to the pose sequence in the pose sequence, and obtain the target video corresponding to the target action.

Specifically, there is a one-to-one correspondence between the poses in the pose sequence and the corresponding video frame images, and the corresponding video frame images are arranged according to the pose sequence in the pose sequence to obtain the target video corresponding to the target action.

In one embodiment, there is a binding correspondence between the ordering of gestures in the gesture sequence and the time stamps of the video frame images, and the gestures in the gesture sequence are sorted, that is, after obtaining the video frame images, according to the The time stamps of the frame images arrange the video frame images, and the target video is obtained according to the video frame images.

In this embodiment, by acquiring the target action and the target pose corresponding to each target image in the target image set, the image corresponding to each target pose in the pose sequence is obtained from the target image set as a video frame image, according to the pose in the pose sequence Sorting Arranges the obtained video frame images to obtain the target video corresponding to the target action, which can achieve the purpose of obtaining the target video corresponding to the target action through pose estimation, so that the pose estimation can be realized and obtained in practical applications.

In one embodiment, the terminal is a panoramic camera and the target object is a human body as an example. As shown in FIG. A human target object for human pose estimation needs to be included. The human target object can be complete, or only contain a part of it or have occlusions. As shown in Figure 10, after the panoramic image is normalized, the coordinate value of the human body bounding box B1 is obtained through the human body tracking or detection algorithm, and the coordinate value of the human body bounding box or the expansion of the human body bounding box is obtained. Coordinate value of bounding box B2 after expansion. As shown in Figure 11, the panoramic image is cropped to obtain a sub-panoramic image with a border of B2. After normalizing the sub-panoramic image, the sub-panoramic image is input into the trained human pose estimation model, as shown in Figure 12 As shown, the heat map of the first key point C is obtained, and the sub-panoramic image is input into the trained human pose estimation model to obtain the heat map of the second key point, and so on, by obtaining multiple heat maps map, and in turn get a preset number of heatmaps with keypoints. The coordinates of the key points in the heat map are mapped to the sub-panoramic image, and the position of the key point in the sub-panoramic image is mapped to the panoramic image, so as to obtain the position of the key point in the panoramic image, thereby estimating the posture of the human body.

In one embodiment, when the terminal performs normalization processing on the panoramic image or the normalization processing on the sub-panoramic image, the pixel value of the pixel point in the normalized image and the pixel value of the pixel point in the original image The proportional relationship between the difference with the average value of the pixel value, and obtain the pixel value of the pixel point in the normalized image. Suppose, the pixel value of a certain pixel in the normalized image is expressed as X_normalization, the pixel value of a certain pixel in the panoramic image or sub-panoramic image is expressed as X, and the pixels of all the pixels in the panoramic image or sub-panoramic image The average value of the value is expressed as mean, the proportional coefficient is expressed as std, and X_normalization is expressed as a formula:

X_normalization = (X-mean)/std

It can be understood that std can be the variance of all pixels in the panoramic image or sub-panoramic image; a certain pixel in the panoramic image or sub-panoramic image can be a pixel of RGB (red, green and blue) three channels.

In an embodiment, the terminal may obtain the coordinate value of the bounding box of the human body through a human body detection algorithm. For example, using Faster RCNN (Faster Region-CNN), YOLO (You Only Look Once) series algorithms, SSD (Single Shot MultiBox Detector) series algorithms, etc. or tracking algorithms such as Siamese (Siamese network ) tracking algorithm, etc.

In one embodiment, as shown in Figure 13, the human pose estimation model can be reduced by reducing the number of image feature blocks between stages in HRNet, for example, changing the number of down-sampled image feature blocks in the second stage to 1 , so that the human pose estimation model can reduce the amount of parameters and calculations, thereby improving the efficiency of human pose estimation.

It should be understood that although the various steps in the flow charts of FIGS. 2-8 are displayed sequentially as indicated by the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in Figures 2-8 may include multiple steps or stages, these steps or stages are not necessarily executed at the same moment, but may be executed at different moments, the execution of these steps or stages The sequence is not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a part of steps or stages in other steps.

In one embodiment, as shown in FIG. 14 , a pose estimation device 1400 is provided, including: a target image acquisition module 1402, a first feature extraction module 1404, an expanded image feature acquisition module 1406, and a second extraction feature acquisition module 1408 , compressed image feature obtaining module 1410 and key point position information determining module 1412, wherein: the target image obtaining module 1402 is used to obtain the target image to be subjected to attitude estimation; the target image includes the target object to be processed; the first feature extraction module 1404, for feature extraction based on the target image, and obtain the first extracted feature; the expanded image feature obtaining module 1406, for performing feature expansion on the first extracted feature through the image feature expansion network, to obtain the expanded image feature; the second extracted feature is obtained Module 1408, for feature extraction of expanded image features, to obtain second extracted features; compressed image feature obtaining module 1410, for feature compression of second extracted features through image feature compression network, to obtain compressed image features; key point position The information determination module 1412 is configured to determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.

In one embodiment, the expanded image feature obtaining module 1406 is used to input the first extracted features into multiple feature convolution channels corresponding to the image feature expansion network, and each feature convolution channel uses the feature dimension to keep the convolution kernel to the first Extract features for convolution to obtain the convolution features output by each feature convolution channel; synthesize the convolution features output by each feature convolution channel to obtain expanded image features.

In one embodiment, the key point position information determination module 1412 is used to amplify the compressed image features to obtain enlarged image features; perform convolution on the enlarged image features to obtain the third extracted features; based on the third extracted features, determine The key point position information corresponding to the target object in the target image.

In one embodiment, the target image acquisition module 1402 is used to acquire an initial image; perform object detection on the initial image to obtain the probability that a plurality of candidate image areas in the initial image respectively include the target object; based on the probability that the candidate image areas include the target object from An object image area including the target object is selected from the candidate image area; according to the object image area, an intercepted image area is extracted from the initial image, and the intercepted image is used as the target image to be pose estimated.

In one embodiment, the target image acquisition module 1402 is used to acquire the center coordinates in the object image area; acquire the area size corresponding to the object image area, and obtain the area extension value based on the area size and the size expansion coefficient; based on the center coordinates and the area extension The value is extended to the extension direction corresponding to the area extension value to obtain the extension coordinates; the image area located within the extension coordinates is used as the intercepted image area, and the intercepted image is used as the target image for pose estimation.

In one embodiment, the target image acquisition module 1402 is configured to convert each key point position information into corresponding target point position information according to the mapping relationship between the key point position information and the target point position information; the target point position information is the key point position The position information of the information in the initial image; the pose estimation of the target object is performed based on each target position information, and the target pose corresponding to the target image is obtained.

In one embodiment, the target video generation device is used to obtain target actions, determine the gesture sequence corresponding to the target action, and the gestures in the gesture sequence are executed in order to obtain the target action; acquire the target gesture corresponding to each target image in the target image set ; Obtain the image corresponding to each target pose in the pose sequence from the target image set as a video frame image; arrange the obtained video frame images according to the pose sequence in the pose sequence to obtain the target video corresponding to the target action.

For specific limitations on the attitude estimation device, please refer to the above-mentioned limitations on the attitude estimation method, which will not be repeated here. Each module in the above-mentioned attitude estimation device can be implemented in whole or in part by software, hardware and a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

In one embodiment, a computer device is provided. The computer device may be a terminal, and its internal structure may be as shown in FIG. 15 . The computer device includes a processor, a memory, a communication interface, a display screen and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used to communicate with an external terminal in a wired or wireless manner, and the wireless manner can be realized through WIFI, an operator network, NFC (Near Field Communication) or other technologies. When the computer program is executed by a processor, a pose estimation method is realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad, or mouse.

Those skilled in the art can understand that the structure shown in Figure 15 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation on the computer equipment on which the solution of this application is applied. The specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.

In one embodiment, there is also provided a computer device, including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the above method embodiments when executing the computer program.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any references to memory, storage, database or other media used in the various embodiments provided in the present application may include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (Read-Only Memory, ROM), tape, floppy disk, flash memory or optical memory, etc. Volatile memory can include Random Access Memory (Random Access Memory, RAM) or external cache memory. By way of illustration and not limitation, RAM can come in many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (Dynamic Random Access Memory) Access Memory, DRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.

The above-mentioned embodiments only represent several implementation modes of the present application, and the description thereof is relatively specific and detailed, but it should not be construed as limiting the scope of the patent for the invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the scope of protection of the patent application should be based on the appended claims.

Claims

A method for attitude estimation, characterized in that the method comprises:

Acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed;

performing feature extraction based on the target image to obtain a first extracted feature;

performing feature expansion on the first extracted feature through an image feature expansion network to obtain expanded image features;

performing feature extraction on the expanded image feature to obtain a second extracted feature;

performing feature compression on the second extracted features through an image feature compression network to obtain compressed image features;

Determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
The method according to claim 1, wherein the image feature expansion network includes a plurality of feature convolution channels, and the image feature expansion network is used to perform feature expansion on the first extracted feature, and the expanded image features obtained include :

Inputting the first extracted features into a plurality of feature convolution channels corresponding to the image feature expansion network, each of the feature convolution channels uses a feature dimension preserving convolution kernel to convolve the first extracted features, Obtain the convolution features output by each of the feature convolution channels;

The convolution features output by each of the feature convolution channels are combined to obtain the expanded image features.
The method according to claim 1, wherein the determining the key point position information corresponding to the target object in the target image based on the compressed image features comprises:

Enlarging the compressed image features to obtain the enlarged image features;

Convolving the enlarged image features to obtain a third extracted feature;

Based on the third extracted feature, determine key point position information corresponding to the target object in the target image.
The method according to claim 1, wherein said obtaining the target image to be subjected to pose estimation comprises:

Get the initial image;

Object detection is performed on the initial image to obtain probabilities that a plurality of candidate image regions in the initial image respectively include a target object;

selecting an object image area including the target object from the candidate image areas based on the probability that the candidate image area includes the target object;

An intercepted image area is extracted from the initial image according to the object image area, and the intercepted image is used as a target image to be subjected to pose estimation.
The method according to claim 4, wherein the extraction of the intercepted image area from the initial image according to the object image area, and using the intercepted image as the target image for pose estimation includes:

Acquiring the center coordinates in the image area of the object;

Acquiring an area size corresponding to the object image area, and obtaining an area extension value based on the area size and a size expansion coefficient;

Extending to an extension direction corresponding to the area extension value based on the center coordinates and the area extension value to obtain extension coordinates;

The image area located within the extended coordinates is used as the intercepted image area, and the intercepted image is used as the target image to be subjected to pose estimation.
The method according to claim 4, wherein the key point position information is multiple, and the method further comprises:

According to the mapping relationship between the key point position information and the target point position information, each of the key point position information is converted into corresponding target point position information; the target point position information is the key point position information in the initial location information in the image;

Estimating the pose of the target object based on each piece of target position information to obtain a target pose corresponding to the target image.
A method for generating a target video, characterized in that the method comprises:

Acquiring a target action, determining a gesture sequence corresponding to the target action, performing the gestures in the gesture sequence in order to obtain the target action;

Process the target image to obtain the target pose corresponding to each target image in the target image set;

Acquiring images corresponding to each target pose in the pose sequence from the set of target images as video frame images;

The obtained video frame images are arranged according to the sequence of gestures in the gesture sequence to obtain the target video corresponding to the target action.
A method for generating a target video, wherein the processing of the target image and obtaining the corresponding target posture of each target image in the target image set includes:

Acquiring a target image to be subjected to pose estimation; the target image includes a target object to be processed;

performing feature extraction based on the target image to obtain a first extracted feature;

performing feature expansion on the first extracted feature through an image feature expansion network to obtain expanded image features;

performing feature extraction on the expanded image feature to obtain a second extracted feature;

performing feature compression on the second extracted features through an image feature compression network to obtain compressed image features;

Determining key point position information corresponding to the target object in the target image based on the compressed image features, and performing pose estimation on the target object based on the key point position information.
The method according to claim 8, wherein the image feature expansion network includes a plurality of feature convolution channels, and the image feature expansion network is used to perform feature expansion on the first extracted feature, and the expanded image features obtained include :

Inputting the first extracted features into a plurality of feature convolution channels corresponding to the image feature expansion network, each of the feature convolution channels uses a feature dimension preserving convolution kernel to convolve the first extracted features, Obtain the convolution features output by each of the feature convolution channels;

The convolution features output by each of the feature convolution channels are combined to obtain the expanded image features.
The method according to claim 8, wherein the determining the key point position information corresponding to the target object in the target image based on the compressed image features comprises:

Enlarging the compressed image features to obtain the enlarged image features;

Convolving the enlarged image features to obtain a third extracted feature;

Based on the third extracted feature, determine key point position information corresponding to the target object in the target image.
The method according to claim 8, wherein said obtaining the target image to be subjected to pose estimation comprises:

Get the initial image;

Object detection is performed on the initial image to obtain probabilities that a plurality of candidate image regions in the initial image respectively include a target object;

selecting an object image area including the target object from the candidate image areas based on the probability that the candidate image area includes the target object;

An intercepted image area is extracted from the initial image according to the object image area, and the intercepted image is used as a target image to be subjected to pose estimation.
The method according to claim 11, wherein said extracting an intercepted image area from said initial image according to said object image area, and using the intercepted image as a target image for pose estimation comprises:

Acquiring the center coordinates in the image area of the object;

Acquiring an area size corresponding to the object image area, and obtaining an area extension value based on the area size and a size expansion coefficient;

Extending to an extension direction corresponding to the area extension value based on the center coordinates and the area extension value to obtain extension coordinates;

The image area located within the extended coordinates is used as the intercepted image area, and the intercepted image is used as the target image to be subjected to pose estimation.
The method according to claim 11, wherein there are multiple key point position information, and the method further comprises:

According to the mapping relationship between the key point position information and the target point position information, each of the key point position information is converted into corresponding target point position information; the target point position information is the key point position information in the initial location information in the image;

Estimating the pose of the target object based on each piece of target position information to obtain a target pose corresponding to the target image.
A pose estimation device, characterized in that the device comprises:

A target image acquisition module, configured to acquire a target image to be subjected to pose estimation; the target image includes a target object to be processed;

The first feature extraction module is used to perform feature extraction based on the target image to obtain the first feature extraction;

The expanded image feature obtaining module is used to perform feature expansion on the first extracted feature through the image feature expansion network to obtain the expanded image feature;

The second extraction feature obtaining module is used to perform feature extraction on the expanded image feature to obtain the second extraction feature;

A compressed image feature obtaining module, configured to perform feature compression on the second extracted feature through an image feature compression network to obtain compressed image features;

A key point position information determining module, configured to determine key point position information corresponding to the target object in the target image based on the compressed image features, and perform pose estimation on the target object based on the key point position information.
A computer device, comprising a memory and a processor, wherein the memory stores a computer program, wherein when the processor executes the computer program, any one of claims 1 to 6 or claims 7 to 13 is implemented. steps of the method described above.
A computer-readable storage medium on which a computer program is stored, wherein when the computer program is executed by a processor, the steps of any one of claims 1 to 6 or the method described in claims 7 to 13 are implemented .