CN114463856B

CN114463856B - Method, device, equipment and medium for training attitude estimation model and attitude estimation

Info

Publication number: CN114463856B
Application number: CN202210381777.1A
Authority: CN
Inventors: 简春兵; 龚凡; 黄瑞琪
Original assignee: Kingsignal Technology Co Ltd
Current assignee: Kingsignal Technology Co Ltd
Priority date: 2022-04-13
Filing date: 2022-04-13
Publication date: 2022-07-19
Anticipated expiration: 2042-04-13
Also published as: CN114463856A

Abstract

The invention discloses a method, a device, equipment and a medium for training and attitude estimation of an attitude estimation model, wherein the method comprises the following steps: acquiring an image sample set; iteratively training an initial attitude estimation model through an image sample set to obtain a target attitude estimation model; wherein the initial attitude estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit are provided with different feature extraction scales, and can provide multi-scale features with higher precision in the same layer for attitude estimation, so that the attitude estimation precision of the attitude estimation model based on the stacked hourglass network is improved.

Description

Training and attitude estimation method, device, equipment and medium of attitude estimation model

Technical Field

The invention relates to the technical field of computers, in particular to a method, a device, equipment and a medium for training and attitude estimation of an attitude estimation model.

Background

Human body posture estimation is a foundation of the field of computer vision and has a challenging task, and is crucial to describing human body posture, human body behavior and the like. With the development of deep learning technology, a human body posture estimation algorithm represented by a stacked hourglass network and based on multi-scale feature fusion shows excellent performance, and has an important position in the field of posture estimation.

However, the existing body posture estimation model based on the stacked hourglass network has large information loss and still has large improvement space.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for training and attitude estimation of an attitude estimation model, wherein a spatial pyramid pooling module with a plurality of feature extraction scales is added in a feature extraction network, and multi-scale features with higher precision at the same layer are extracted for attitude estimation, so that the attitude estimation precision of the attitude estimation model based on a stacked hourglass network is improved.

According to an aspect of the present invention, there is provided a training method of a pose estimation model, including:

acquiring an image sample set;

iteratively training an initial attitude estimation model through the image sample set to obtain a target attitude estimation model;

wherein the initial pose estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit set different feature extraction scales.

According to another aspect of the present invention, there is provided an attitude estimation method, including:

acquiring an image to be analyzed;

inputting the image to be analyzed into a target attitude estimation model obtained by training by adopting an attitude estimation model training method;

acquiring a target positioning diagram and a target bias diagram of the image to be analyzed output by the target attitude estimation model;

and determining the estimated coordinates of each key point of the target body in the image to be analyzed based on the target positioning graph and the target offset graph, and estimating the posture of the target body according to the estimated coordinates of each key point.

According to another aspect of the present invention, there is provided a training apparatus for a pose estimation model, including:

the acquisition module is used for acquiring an image sample set;

the training module is used for iteratively training an initial attitude estimation model through the image sample set to obtain a target attitude estimation model; wherein the initial pose estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit set different feature extraction scales.

According to another aspect of the present invention, there is provided an attitude estimation device including:

the first acquisition module is used for acquiring an image to be analyzed;

the input module is used for inputting the image to be analyzed into a target attitude estimation model obtained by training through an attitude estimation model training method;

the second acquisition module is used for acquiring a target positioning map and a target offset map of the image to be analyzed, which are output by the target attitude estimation model;

and the attitude estimation module is used for determining the estimated coordinates of each key point of the target body in the image to be analyzed based on the target positioning graph and the target bias graph and estimating the attitude of the target body according to the estimated coordinates of each key point.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform a method for training a pose estimation model and/or a method for pose estimation according to any embodiment of the present invention.

According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement a training method and/or a pose estimation method of a pose estimation model according to any one of the embodiments of the present invention when the computer instructions are executed.

According to the technical scheme of the embodiment of the invention, an image sample set is obtained; iteratively training an initial attitude estimation model through an image sample set to obtain a target attitude estimation model; wherein the initial attitude estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit are provided with different feature extraction scales, and can provide multi-scale features with higher precision at the same layer for attitude estimation, so that the attitude estimation precision of the attitude estimation model based on the stacked hourglass network is improved.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a training method of an attitude estimation model according to an embodiment of the present invention;

FIG. 2A is a schematic structural diagram of an attitude estimation model according to an embodiment of the present invention;

FIG. 2B is a schematic structural diagram of an attitude estimation model according to a second embodiment of the present invention;

fig. 2C is a schematic structural diagram of a spatial pyramid pooling module according to a second embodiment of the present invention;

FIG. 3A is a flowchart of a method for training a pose estimation model according to a third embodiment of the present invention;

FIG. 3B is a schematic diagram of training a model to be trained according to a third embodiment of the present invention;

FIG. 4 is a flowchart of an attitude estimation method according to a fourth embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a training apparatus for a pose estimation model according to a fifth embodiment of the present invention;

fig. 6 is a schematic structural diagram of an attitude estimation device according to a sixth embodiment of the present invention;

fig. 7 is a schematic structural diagram of an electronic device implementing a method for training a pose estimation model according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example one

Fig. 1 is a flowchart of a method for training a pose estimation model according to an embodiment of the present invention, where this embodiment is applicable to a case of training a pose estimation model, and the method may be implemented by a training apparatus of the pose estimation model, where the training apparatus of the pose estimation model may be implemented in a hardware and/or software, and the training apparatus of the pose estimation model may be configured in an electronic device.

As shown in fig. 1, the method specifically includes the following steps:

and S110, acquiring an image sample set.

Wherein the image sample set is a set of sample images used for training the initial pose estimation model. The pose of the target volume for the images in the image sample set may be determined by the location of each keypoint, and therefore, the true coordinates of the keypoint in each image that is marked with the target volume. Wherein, the key points can be regarded as the joint points or other important supporting points of the human body. Typically, the input sample images are all 256 in size

256。

Illustratively, the MPII human pose estimation dataset may be employed to determine the set of image samples. The MPII human pose estimation data set may include a training set and a test set. The MPII human pose estimation dataset is a picture dataset that records human activities in the real world, and contains about 25000 pictures and 40000 subjects.

Optionally, in order to improve the accuracy and generalization capability of the trained model to the posture estimation of the input image to be analyzed, data enhancement may be performed on the sample image in the image sample set through a preset operation; the preset operation includes at least one of: random rotation, random flipping, and random resizing to enhance the randomness and diversity of the sample image.

Exemplary, where the range of rotation is (-45)^o,45^o) The random resizing range is (0.65,1.35), and the probability of random flipping is 0.5.

S120, iteratively training the initial attitude estimation model through the image sample set to obtain a target attitude estimation model; wherein the initial attitude estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit set different feature extraction scales.

Wherein the initial attitude estimation model comprises: a feature extraction network 10 and a stacked hourglass network 20; the feature extraction network 10 is used for performing feature extraction on sample images in the image sample set to obtain target feature images, and is a main part of the initial attitude estimation model; the stacked hourglass network 20 may be stacked from at least two hourglass networks for integrating information for all scales of the feature image.

Specifically, in the existing pose estimation model, the feature extraction network generally extracts feature data in a sample image through two down-sampling layers and three feature extraction layers, lacks multi-scale features from the same level, and has a large information loss. Therefore, in the embodiment of the present invention, as shown in fig. 2A, a spatial pyramid pooling module 12 is added after the last feature extraction layer of the original feature extraction network to form the feature extraction network 10 in the embodiment of the present invention, that is, the feature extraction network 10 includes: a feature extraction module 11 and a spatial pyramid pooling module 12. The spatial pyramid pooling module 12 includes: a first feature extraction unit 121 and a second feature extraction unit 122 connected in parallel, and a feature fusion unit 123. The first feature extraction unit 121 and the second feature extraction unit 122 may set feature extraction operations of different scales, and thus, multi-scale features of the same layer in the sample image may be extracted by the first feature extraction unit 121 and the second feature extraction unit 122 connected in parallel in the spatial pyramid pooling module 12 to obtain scale features with higher accuracy. The feature fusion unit 123 is configured to perform feature fusion on features of different scales output by the first feature extraction unit and the second feature extraction unit to obtain a target feature image, input the target feature image into the stacked hourglass network to perform integration of information of all scales, so as to obtain a joint point heat map and an offset map of a target body in the sample image, and perform iterative adjustment on network parameters in the initial posture estimation model according to the positioning map, the offset map and the sample image output by each training to obtain the target posture estimation model.

It should be noted that, the target attitude estimation model may be obtained by iteratively training the initial attitude estimation model to adjust the network parameters in one training stage, or the training process may be divided into a plurality of training stages, and the target attitude estimation model may be obtained by iteratively training the initial attitude estimation model to adjust the network parameters step by step in each training stage.

Example two

Fig. 2B is a schematic structural diagram of an initial pose estimation model according to a second embodiment of the present invention, and this embodiment further details the structure of the spatial pyramid pooling module according to the first embodiment. As shown in fig. 2B, the initial pose estimation model includes: a feature extraction network 10 and a stacked hourglass network 20; the feature extraction network 10 includes: a feature extraction module 11 and a spatial pyramid pooling module 12; the spatial pyramid pooling module 12 includes: a first feature extraction unit 121 and a second feature extraction unit 122, which are connected in parallel, and a feature fusion unit 123; the first feature extraction unit 121 and the second feature extraction unit 122 set different feature extraction scales; the first feature extraction unit 121 includes: at least one first build-up layer; the second feature extraction unit 122 includes: a first predetermined number of second convolutional layers 1221 and pyramid pooling layers 1222 in series.

Specifically, the first feature unit 121 includes at least one first convolution layer 1211, configured to store and refine feature information of the feature image output by the feature extraction module to obtain more refined feature information; the second feature cell 122 includes: a first preset number of second convolution layers 1221 and a pyramid pooling layer 1222 connected in series, the pyramid pooling layer 1222 including a plurality of scales of pooling operations for further extracting multi-scale features of the feature image output by the feature extraction module. The multi-scale features extracted by the first feature unit 121 and the second feature unit 122 are fused by the feature fusion unit 123, so that more refined multi-scale information is obtained.

For example, the first convolution layer 1211 and the second convolution layer 1221 may have convolution kernels of 1

1, in the form of a convolutional layer.

Optionally, the first preset number is 3; the convolution kernels of the 3 second convolution layers are all 1

The 1 st convolutional layer and the 2 nd convolutional layer in the 1, 3 second convolutional layers are connected, and the 2 nd convolutional layer is connected with the pyramid pooling layer; the pyramid pooling layer is connected with the 3 rd convolution layer;

the pyramid pooling layer includes: at least two maximal pooling layers connected in parallel, each maximal pooling layer having a different convolution kernel scale.

Specifically, fig. 2C is a schematic structural diagram of the spatial pyramid pooling module 12. As shown in fig. 2C, a spatial pyramid poolThe chemosynthesis module 12 includes: a first feature extraction unit 121 and a second feature extraction unit 122 connected in parallel, and a feature fusion unit 123. The first feature extraction unit 121 includes a convolution kernel of 1

1, the second feature extraction unit 122 includes: 3 convolution kernels of 1

Convolutional layer 1221 (1221.1, 1221.2, and 1221.3, respectively) and pyramidal pooling layer 1222 of 1; the 1 st convolutional layer 1221.1 and the 2 nd convolutional layer 1221.2 in the 3 convolutional layers are connected, and the 2 nd convolutional layer 1221.2 is connected with the pyramid pooling layer 1222; pyramid pooling layer 1222 is connected to the 3 rd convolution layer 1221.3. Pyramid pooling layer 1222 includes: at least two maximal pooling layers connected in parallel, each maximal pooling layer having a different convolution kernel scale.

For example, pyramid pooling layer 1222 may be formed by four convolution kernels of 3 in parallel

3、5

5、9

9 and 13

13, the largest pooling layer of the set of layers constitutes a pyramid-structured pooling layer.

Optionally, the feature extraction module 11 includes: the device comprises a first preset number of down-sampling layers and a second preset number of feature extraction layers, wherein the down-sampling layers are convolutional layers.

TABLE 1

Illustratively, the first preset number is 2, and the second preset number is 3, that is, as shown in table 1, the hierarchical structure of the feature extraction module 11 includes: the 1 st down-sampling layer, the 1 st feature extraction layer, the 2 nd down-sampling layer, the 2 nd feature extraction layer and the 3 rd feature extraction layer.

In the conventional feature extraction network, the 2 nd down-sampling layer is generally the maximum pooling layer, and the embodiment of the present invention replaces the 2 nd down-sampling layer from the maximum pooling layer with the convolutional layer, thereby improving the feature extraction accuracy.

The technical scheme of the embodiment of the invention comprises the steps of obtaining an image sample set; iteratively training an initial attitude estimation model through an image sample set to obtain a target attitude estimation model; wherein the initial attitude estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit set different feature extraction scales; the first feature extraction unit includes: at least one first build-up layer; the second feature extraction unit includes: the second convolution layers and the pyramid pooling layers which are connected in series in the first preset number can provide multi-scale features with higher precision of the same layer for attitude estimation, so that the attitude estimation precision of the attitude estimation model based on the stacked hourglass network is improved.

EXAMPLE III

Fig. 3A is a flowchart of a training method of a pose estimation model according to a third embodiment of the present invention, and this embodiment further details step S120 in the first embodiment. As shown in fig. 3A, the method includes:

and S310, acquiring an image sample set.

S320, determining a first image sample set in the image sample set as a target image sample set, and determining an initial attitude estimation model as a model to be trained; iteratively training a model to be trained through a target image sample set to obtain an effective attitude estimation model; wherein the initial attitude estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit are provided with different feature extraction scales; the stacked hourglass network includes: a third preset number of hourglass networks; the third preset number is less than the optimal set number.

The stacked hourglass networks in the initial posture estimation model can comprise at least one hourglass network, and the posture estimation effect of the posture estimation model is best when the number of the hourglass networks is the optimal set number according to priori knowledge, and at the moment, because the set number of the hourglass networks is large, the time for performing model training once is long, and the efficiency of model training is not high. In order to solve the above problem, in the embodiment of the present invention, the process of the pose estimation model is completed by two training stages. The stacked hourglass network of the initial pose estimation model in the first training phase comprises: a third predetermined number of hourglass networks, the third predetermined number being less than the optimal number of hourglass networks.

Specifically, the validity of the initial attitude estimation model is verified in the first training stage, and the specific implementation process is as follows: and determining a first image sample set in the image sample set as a target image sample set, determining the initial attitude estimation model as a model to be trained, and iteratively training the model to be trained through the target image sample set to obtain an effective attitude estimation model. The training time for determining the effectiveness of the model is greatly shortened due to the fact that the number of the hourglass networks in the initial attitude estimation model is small.

Optionally, the third preset number is 2, and the optimal number is 8.

S330, increasing the number of the hourglass networks stacked in the hourglass network effective posture estimation model containing the hourglass networks with the optimal setting number to the optimal setting number.

Specifically, on the basis of ensuring the effectiveness of the effective attitude estimation model, the number of hourglass networks of the stacked hourglass networks in the effective attitude estimation model obtained by iterative training of the initial attitude estimation model is increased to the optimal set number; therefore, iterative training is continuously carried out on the effective attitude estimation model containing the hourglass networks with the optimal set number, and a complete target attitude estimation model with higher precision and stronger characteristic representation capability is further obtained.

S340, determining a second image sample set in the image sample set as a target image sample set, and determining an effective posture estimation model containing the optimal set number of hourglass networks as a model to be trained.

Specifically, in the second training phase, a second image sample set in the image sample set is determined as a target image sample set, and an effective posture estimation model containing the optimum set number of hourglass networks is determined as a model to be trained.

And S350, returning to execute the operation of iteratively training the model to be trained through the target image sample set to obtain the target posture estimation model.

Specifically, the operation of iteratively training the model to be trained through the target image sample set is returned to be executed, iterative training is carried out on the effective attitude estimation model through the second image sample set, and finally the target attitude estimation model is obtained.

According to the technical scheme of the embodiment of the invention, an image sample set is obtained; determining a first image sample set in the image sample set as a target image sample set, and determining an initial attitude estimation model as a model to be trained; iteratively training a model to be trained through a target image sample set to obtain an effective attitude estimation model; wherein the initial attitude estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit set different feature extraction scales; the stacked hourglass network includes: a third preset number of hourglass networks; the third preset number is smaller than the optimal set number; increasing the number of the hourglass networks of the stacked hourglass networks in the effective attitude estimation model containing the hourglass networks with the optimal set number to the optimal set number; determining a second image sample set in the image sample set as a target image sample set, and determining an effective attitude estimation model containing the hourglass network with the optimal set number as a model to be trained; returning to execute the operation of iteratively training the model to be trained through the target image sample set to obtain a target attitude estimation model, and providing and extracting multi-scale features with higher precision on the same layer for attitude estimation, so that the attitude estimation precision of the attitude estimation model based on the stacked hourglass network is improved; meanwhile, the training process of the initial attitude estimation model is divided into two different stages, and a small number of hourglass networks are set in the first stage of verifying the effectiveness of the model, so that the training time for verifying the effectiveness of the model is shortened.

Optionally, the operation of iteratively training the model to be trained through the target image sample set includes:

inputting sample images in a target image sample set into a feature extraction module of a model to be trained for feature extraction to obtain a first feature image;

inputting the first characteristic image into a first characteristic unit and a second characteristic unit in a spatial pyramid pooling module of the model to be trained respectively, and inputting the second characteristic image output by the first characteristic unit and a third characteristic image output by the second characteristic unit into a characteristic fusion unit in the spatial pyramid pooling module of the model to be trained to obtain a target characteristic image;

inputting the target characteristic image into a stacked hourglass network of a model to be trained, obtaining a positioning diagram and an offset diagram of a sample image, and determining the estimated coordinates of each key point of a target body in the sample image according to the positioning diagram and the offset diagram;

calculating a loss function value according to the estimated coordinates of each key point and the real coordinates of each key point, and adjusting network parameters in the model to be trained on the basis of the loss function value;

and returning to execute the operation of inputting the sample images in the target image sample set into the feature extraction module of the model to be trained to extract features until a preset condition is reached.

Specifically, the process of iteratively training the initial posture estimation model through the image sample set to obtain the target posture estimation model can be completed in two training stages, wherein models to be trained in the two training stages are different from training samples, but the specific steps of training are basically the same. The model to be trained comprises: the device comprises a feature extraction module, a pyramid module and a stacked hourglass network. Therefore, as shown in fig. 3B, the training process specifically includes: inputting sample images in a target image sample set into a feature extraction module 11 in a feature extraction network 10 of a model to be trained for feature extraction to obtain a first feature image; inputting the first feature image into a spatial pyramid pooling module 12 in a feature extraction network 10 of the model to be trained, performing multi-scale feature extraction on the first feature image at the same level through a first feature extraction unit 121 and a second feature extraction unit 122 of the spatial pyramid pooling module 12, and performing feature fusion through a feature fusion module 123 to obtain a target feature image; and sequentially inputting the target characteristic images into a preset number of hourglass networks in the stacked hourglass networks 20 of the model to be trained to obtain a positioning diagram and an offset diagram of the sample image.

The positioning diagram is used for representing positioning areas corresponding to key points of a target body in a sample image, and in the prior art, a Gaussian distribution heat map is usually used for representing the coordinate positions of the key points of the target body in the sample image.

Specifically, the specific steps of calculating the loss function value according to the positioning graph, the bias graph and the sample image, and adjusting the network parameters in the model to be trained based on the loss function value may be: and carrying out unbiased estimation on the coordinates of each key point in the positioning map according to the bias map, calculating a loss function value according to the estimated coordinates obtained by unbiased estimation and the real coordinates of the key points in the sample image, and adjusting the network parameters in the model to be trained according to the loss function value. And returning to execute the operation of inputting the sample images in the target image sample set into the feature extraction module of the model to be trained to extract features until a preset condition is reached. The preset condition may be that the loss function value is the minimum, or the training times reach a set threshold.

Exemplary, the experiments on the MPII data set used an evaluation index, i.e., a loss function value of

I.e. with less error than the length of the head

The double is regarded as the positive example, commonly used

Values of 0.5 and 0.1. As follows, in the following manner,

the expression of (a) is:

；

wherein the content of the first and second substances, Prepresenting all the keypoint coordinates estimated from the localizer map,pone of the keypoint coordinates is represented,

representing the true coordinates of the corresponding keypoint coordinates.

Represents the length scale factor and the length scale factor,

representing the length of the corresponding person's head.

Representing a unary function, if the parameter is true, returning to 1, otherwise, returning to 0;nrepresenting the total number of keypoints in the picture.

Optionally, the bias map includes: the first direction offset map and the second direction offset map are used for determining the estimated coordinates of each key point of the target body in the sample image according to the positioning map and the offset map, and the method comprises the following steps:

determining the maximum response coordinate corresponding to the positioning area of each key point in the positioning map; the positioning area of the key points is a preset range corresponding to the real coordinates of each key point in the sample image;

for each key point, determining a first bias response coordinate according to the maximum response coordinate and a first bias coordinate of the key point in the first direction bias graph; determining a second bias response coordinate according to the maximum response coordinate and a second bias coordinate of the key point in the second direction bias graph;

and for each key point, carrying out unbiased estimation on the maximum response coordinate according to the first biased response coordinate and the second biased response coordinate to obtain an estimated coordinate.

Wherein, the bias map includes: the first direction offset map may be an offset map in the X direction, and the second direction offset map may be an offset map in the Y direction.

Specifically, the positioning area of each key point in the positioning map may be represented as:

；

the first offset coordinate and the second offset coordinate of each key point in the first direction offset map can be respectively expressed as:

wherein (A), (B), (C), (D), (C), (B), (C)x ₀,y ₀,i) Is shown asiThe estimated coordinates of the location area of the individual keypoints,P(x ₀,y ₀,i) Is shown asiThe numerical value corresponding to the positioning area of each key point;Ra radius indicating a location area (m _i, n _i) Represents the firstiThe true coordinates of each keypoint. Determining a location mapThe maximum response coordinate corresponding to the positioning area of each key point in the method may be:

；

wherein the content of the first and second substances,Ka gaussian kernel is shown for smoothing the scout map,

showing the first in the positioning diagramiAnd the maximum response coordinate corresponding to the positioning area of each key point.

For each keypoint, the manner of determining the estimated coordinates according to the first bias response coordinates and the second bias response coordinates of each keypoint may be:

；

indicating the first direction offset diagramiA first biased response coordinate for each keypoint;

indicating the second direction offsetiA second biased response coordinate of the keypoint;

represents the firstiEstimated coordinates of the individual keypoints. Illustratively, R is 1, and the Gaussian kernel size used for the localization map is 15

15, the Gaussian kernel size adopted for the bias graph is 7

7。

In one particular embodiment, the results on the MPII validation set employ only the MPII training set as training samples. Three comparative examples were used to verify the effectiveness of the initial pose estimation model provided by the inventive embodiments. The specific experimental results are shown in the following comparison:

comparative example 1:

a prior art stacked hourglass network containing 2 hourglass was used as a baseline model for ablation experiments.

Comparative example 2:

on the basis of sampling an attitude estimation network comprising 2 hourglass networks in the prior art, the strategy for carrying out unbiased estimation on the maximum response coordinate according to the first offset response coordinate and the second offset response coordinate to obtain the estimated coordinate provided by the embodiment of the invention is added.

Comparative example 3:

on the basis of comparative example 2, the initial attitude estimation model in the embodiment of the invention is adopted, and comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit set different feature extraction scales.

In the comparative example, the results obtained from the MPII verification set by using the MPII training set as the training sample are shown in table 2, which can verify the effectiveness of the effective pose estimation model obtained by the training method of the initial pose estimation model according to the embodiment of the present invention. Illustratively, during the training process, the initial learning rate is 1e-4, and when the number of training iterations reaches 170 and 200, the learning rate is reduced to 0.1 and 0.01 of the initial learning rate, respectively, and the Adam gradient optimizer is used, the batch size is set to 16, and the total number of training is set to 210.

TABLE 2

After verifying the validity of the effective attitude estimation model, the number of hourglass networks in the effective attitude estimation model is increased to 8. The results obtained by training the effective pose estimation models provided in comparative example 4 and this example using the MPII training set and the verification set as training samples on the MPII test set are shown in table 3:

comparative example 4:

the attitude estimation model of the stacked hourglass network in the prior art comprises 8 hourglass;

the embodiment of the invention comprises the following steps:

an efficient pose estimation model containing 8 hourglass networks.

TABLE 3

From the comparison of the results in table 2, it can be found that, by using a stacked hourglass network including 2 hourglass as a baseline model, on the verification set of the MPII human posture estimation, the unbiased estimation strategy provided by the embodiment of the present invention is introduced to obtain a boost of PCKh @ 0.5% and a boost of PCKh @0.1 of 2.7%, which illustrates the effectiveness of the unbiased estimation strategy for achieving high-precision posture estimation.

On the basis, the feature extraction network is modified, a down-sampling module is replaced, a spatial pooling pyramid module based on cross-stage division is added, and 0.3% PCKh @0.5 and 0.6% PCKh @0.1 are further improved, so that the feature extraction network provided by the embodiment of the invention has stronger feature representation capability.

As can be found from the comparison of the results in table 3, the target posture estimation model provided in the embodiment of the present invention is used as a complete posture estimation model based on the stacked hourglass network improved by the spatial pooling pyramid, and compared with the posture estimation model based on the stacked hourglass network including 8 hourglass in the prior art, a 1.2% elevation of PCKh @0.5 and a 2.1% elevation of PCKh @0.1 are obtained on the MPII test set, which further illustrates the effectiveness of the target posture estimation model obtained by training the training method of the posture estimation model provided in the embodiment of the present invention.

Example four

Fig. 4 is a flowchart of a pose estimation method according to a fourth embodiment of the present invention, where this embodiment is applicable to a case where a target pose estimation model obtained by training based on the foregoing embodiments performs pose estimation on a target object in an image to be analyzed, and the method may be executed by a pose estimation apparatus, where the pose estimation apparatus may be implemented in a form of hardware and/or software, and the pose estimation apparatus may be configured in an electronic device.

As shown in fig. 1, the method specifically includes the following steps:

and S410, acquiring an image to be analyzed.

The image to be analyzed is an image including a target body to be subjected to attitude estimation. Generally, the size of the image to be analyzed is the same as that of the sample image, and both are 256

256。

And S420, inputting the image to be analyzed into the target attitude estimation model obtained by training by adopting the training method of the attitude estimation model.

Specifically, the target attitude estimation model obtained by training with the attitude estimation model training method includes: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit set different feature extraction scales. The multi-scale features of the same layer in the image to be analyzed can be extracted through the first feature extraction unit and the second feature extraction unit which are connected in parallel in the spatial pyramid pooling module, so that the scale features with higher precision can be obtained.

And S430, acquiring a target positioning diagram and a target offset diagram of the image to be analyzed output by the target attitude estimation model.

The target positioning graph is used for reflecting the coordinates of each key point of a target body in the image to be analyzed; the target bias map includes: the target bias map may be, for example, an X-direction target bias map, and the second-direction target bias map may be a Y-direction target bias map. The target offset map is used for carrying out unbiased estimation on the coordinates of the target body in the target positioning map.

S440, determining the estimated coordinates of each key point of the target body in the image to be analyzed based on the target positioning graph and the target offset graph, and estimating the posture of the target body according to the estimated coordinates of each key point.

Specifically, unbiased estimation is performed according to the target offset coordinates of each key point in the target offset map and the coordinates of each key point in the target positioning map to obtain the estimated coordinates of each key point, so that the posture of the target body in the image to be analyzed is estimated according to the estimated coordinates of each key point.

According to the technical scheme of the embodiment of the invention, the image to be analyzed is obtained; inputting an image to be analyzed into a target attitude estimation model obtained by training through the attitude estimation model training method of any one of the embodiments; acquiring a target positioning diagram and a target bias diagram of an image to be analyzed output by a target attitude estimation model; the estimated coordinates of all key points of the target body in the image to be analyzed are determined based on the target positioning graph and the target bias graph, the posture of the target body is estimated according to the estimated coordinates of all the key points, and the high-precision multi-scale features of the image to be analyzed can be obtained, so that the posture estimation precision of the image to be analyzed is improved.

EXAMPLE five

Fig. 5 is a schematic structural diagram of a training apparatus for a pose estimation model according to a fifth embodiment of the present invention. As shown in fig. 5, the apparatus includes: an acquisition module 510 and a training module 520;

the obtaining module 510 is configured to obtain an image sample set;

a training module 520, configured to iteratively train an initial pose estimation model through the image sample set to obtain a target pose estimation model; wherein the initial pose estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit set different feature extraction scales.

Optionally, the first feature extraction unit includes: at least one first build-up layer; the second feature extraction unit includes: a first predetermined number of second convolution layers and pyramid pooling layers connected in series.

1, the 1 st convolutional layer and the 2 nd convolutional layer of the 3 second convolutional layers are connected, and the 2 nd convolutional layer is connected with the pyramid pooling layer; the pyramid pooling layer is connected with a 3 rd convolution layer;

the pyramid pooling layer includes: at least two maximal pooling layers connected in parallel, the convolution kernel scales of each maximal pooling layer being different.

Optionally, the feature extraction module includes: the method comprises the steps of sampling layers with a first preset number and feature extraction layers with a second preset number; the downsampling layer is a convolutional layer.

Optionally, the stacked hourglass network in the initial pose estimation model includes: a third preset number of hourglass networks; the third preset number is less than the optimal set number; the training module comprises:

a first determining unit, configured to determine a first image sample set in the image sample sets as a target image sample set, and determine the initial pose estimation model as a model to be trained;

the iterative training unit is used for iteratively training the model to be trained through the target image sample set to obtain an effective attitude estimation model;

a number setting unit, configured to increase the number of hourglass networks stacked in the effective attitude estimation model to an optimal set number;

the second determining unit is used for determining a second image sample set in the image sample sets as a target image sample set and determining an effective posture estimation model containing the optimal set number of hourglass networks as a model to be trained;

and the execution unit is used for returning to execute the operation of iteratively training the model to be trained through the target image sample set to obtain a target posture estimation model.

Optionally, the iterative training unit includes:

the first input subunit is used for inputting the sample images in the target image sample set into the feature extraction module of the model to be trained for feature extraction to obtain a first feature image;

the second input subunit is configured to input the first feature image into a first feature unit and a second feature unit in the spatial pyramid pooling module of the model to be trained, and input a second feature image output by the first feature unit and a third feature image output by the second feature unit into a feature fusion unit in the spatial pyramid pooling module of the model to be trained to obtain a target feature image;

the estimation subunit is used for inputting the target feature image into the stacked hourglass network of the model to be trained to obtain a positioning map and an offset map of the sample image, and determining the estimated coordinates of each key point of the target body in the sample image according to the positioning map and the offset map;

the calculation subunit is used for calculating a loss function value according to the estimated coordinates of each key point and the real coordinates of each key point, and adjusting the network parameters in the model to be trained on the basis of the loss function value;

and the execution subunit is used for returning and executing the operation of inputting the sample images in the target image sample set into the feature extraction module of the model to be trained for feature extraction until a preset condition is reached.

Optionally, the bias map includes: the estimation subunit is specifically configured to:

for each key point, determining a first bias response coordinate according to the maximum response coordinate and a first bias coordinate of the key point in the first direction bias graph; determining a second bias response coordinate according to the maximum response coordinate and a second bias coordinate of the key point in a second direction bias graph;

Optionally, the method further includes:

the data enhancement module is used for enhancing the data of the sample images in the image sample set through preset operation;

the preset operation comprises at least one of the following operations: random rotation, random flipping, and random resizing.

The training device of the attitude estimation model provided by the embodiment of the invention can execute the training method of the attitude estimation model provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

EXAMPLE six

Fig. 6 is a schematic structural diagram of a training apparatus for a pose estimation model according to a sixth embodiment of the present invention. As shown in fig. 6, the apparatus includes: a first acquisition module 610, an input module 620, a second acquisition module 630 and an attitude estimation module 640;

a first obtaining module 610, configured to obtain an image to be analyzed;

an input module 620, configured to input the image to be analyzed into a target pose estimation model obtained by training using a training method of a pose estimation model according to any embodiment;

a second obtaining module 630, configured to obtain a target positioning map and a target offset map of the image to be analyzed, which are output by the target posture estimation model;

and the posture estimation module 640 is configured to determine the estimated coordinates of each key point of the target body in the image to be analyzed based on the target positioning map and the target offset map, and estimate the posture of the target body according to the estimated coordinates of each key point.

The attitude estimation device provided by the embodiment of the invention can execute the attitude estimation method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

EXAMPLE seven

FIG. 7 illustrates a schematic diagram of an electronic device 70 that may be used to implement an embodiment of the present invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not intended to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 7, the electronic device 70 includes at least one processor 71, and a memory communicatively connected to the at least one processor 71, such as a Read Only Memory (ROM) 72, a Random Access Memory (RAM) 73, and the like, wherein the memory stores computer programs executable by the at least one processor, and the processor 71 may perform various appropriate actions and processes according to the computer programs stored in the Read Only Memory (ROM) 72 or the computer programs loaded from the storage unit 78 into the Random Access Memory (RAM) 73. In the RAM 73, various programs and data necessary for the operation of the electronic apparatus 70 can also be stored. The processor 71, the ROM 72, and the RAM 73 are connected to each other by a bus 74. An input/output (I/O) interface 75 is also connected to bus 74.

A plurality of components in the electronic device 70 are connected to the I/O interface 75, including: an input unit 76 such as a keyboard, a mouse, etc.; an output unit 77 such as various types of displays, speakers, and the like; a storage unit 78, such as a magnetic disk, optical disk, or the like; and a communication unit 79 such as a network card, modem, wireless communication transceiver, etc. The communication unit 79 allows the electronic device 70 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Processor 71 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 71 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. Processor 71 performs the various methods and processes described above, such as training of a pose estimation model or a pose estimation method.

In some embodiments, the training of the pose estimation model or the pose estimation method may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 78. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 70 via the ROM 72 and/or the communication unit 79. When the computer program is loaded into RAM 73 and executed by processor 71, one or more steps of the above-described method of training a pose estimation model and/or pose estimation may be performed. Alternatively, in other embodiments, processor 71 may be configured by any other suitable means (e.g., by means of firmware) to perform the training of the pose estimation model and/or the pose estimation method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., identified as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.

The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for training a pose estimation model, comprising:

acquiring an image sample set;

wherein the initial pose estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the system comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit are provided with different feature extraction scales;

the stacked hourglass network in the initial pose estimation model comprises: a third preset number of hourglass networks; the third preset number is less than the optimal set number;

the iteratively training the initial attitude estimation model through the image sample set to obtain a target attitude estimation model comprises:

determining a first image sample set in the image sample sets as a target image sample set, and determining the initial attitude estimation model as a model to be trained;

iteratively training the model to be trained through the target image sample set to obtain an effective attitude estimation model;

increasing the number of the hourglass networks of the stacked hourglass networks in the effective attitude estimation model to the optimal set number;

determining a second image sample set in the image sample set as a target image sample set, and determining an effective attitude estimation model containing the hourglass network with the optimal set number as a model to be trained;

and returning to execute the operation of iteratively training the model to be trained through the target image sample set to obtain a target attitude estimation model.

2. The method according to claim 1, wherein the first feature extraction unit comprises: at least one first build-up layer; the second feature extraction unit includes: a first preset number of second convolutional layers and pyramid pooling layers connected in series.

3. The method according to claim 2, wherein said first preset number is 3; the convolution kernels of the 3 second convolution layers are all 1

1, the 1 st convolutional layer and the 2 nd convolutional layer of the 3 second convolutional layers are connected, and the 2 nd convolutional layer is connected with the pyramid pooling layer; the pyramid pooling layer is connected with the 3 rd convolution layer;

4. The method of claim 1, wherein the feature extraction module comprises: the method comprises the steps of sampling layers with a first preset number and feature extraction layers with a second preset number; the downsampling layer is a convolutional layer.

5. The method of claim 1, wherein the operation of iteratively training the model to be trained through the set of target image samples comprises:

inputting the sample images in the target image sample set into a feature extraction module of the model to be trained for feature extraction to obtain a first feature image;

inputting the first feature image into a first feature unit and a second feature unit in a spatial pyramid pooling module of the model to be trained respectively, and inputting a second feature image output by the first feature unit and a third feature image output by the second feature unit into a feature fusion unit in the spatial pyramid pooling module of the model to be trained to obtain a target feature image;

inputting the target feature image into a stacked hourglass network of the model to be trained to obtain a positioning map and an offset map of the sample image, and determining estimated coordinates of each key point of a target body in the sample image according to the positioning map and the offset map;

and returning to execute the operation of inputting the sample images in the target image sample set into the feature extraction module of the model to be trained for feature extraction until a preset condition is reached.

6. The method of claim 5, wherein the bias map comprises: determining estimated coordinates of each key point of a target body in the sample image according to the positioning map and the bias map, wherein the determining comprises:

determining the maximum response coordinate corresponding to the positioning area of each key point in the positioning graph; the positioning area of the key point is a preset range corresponding to the real coordinate of each key point in the sample image;

and for each key point, carrying out unbiased estimation on the maximum response coordinate according to the first biased response coordinate and the second biased response coordinate to determine an estimated coordinate.

7. The method of claim 1, further comprising: performing data enhancement on the sample image in the image sample set through a preset operation;

8. An attitude estimation method, comprising:

acquiring an image to be analyzed;

inputting the image to be analyzed into a target attitude estimation model obtained by training by adopting the attitude estimation model training method of any one of claims 1 to 7;

and determining the estimated coordinates of each key point of the target body in the image to be analyzed based on the target positioning graph and the target bias graph, and estimating the posture of the target body according to the estimated coordinates of each key point.

9. An apparatus for training an attitude estimation model, comprising:

the acquisition module is used for acquiring an image sample set;

the training module is used for iteratively training an initial attitude estimation model through the image sample set to obtain a target attitude estimation model; wherein the initial pose estimation model comprises: a feature extraction network and a stacked hourglass network; the feature extraction network includes: the system comprises a feature extraction module and a spatial pyramid pooling module; the spatial pyramid pooling module includes: the device comprises a first feature extraction unit, a second feature extraction unit and a feature fusion unit which are connected in parallel; the first feature extraction unit and the second feature extraction unit are provided with different feature extraction scales;

wherein the stacked hourglass network in the initial pose estimation model comprises: a third preset number of hourglass networks; the third preset number is less than the optimal set number;

the training module comprises:

10. An attitude estimation device, characterized by comprising:

the first acquisition module is used for acquiring an image to be analyzed;

an input module, configured to input the image to be analyzed into a target pose estimation model obtained by training using the pose estimation model training method according to any one of claims 1 to 7;

and the attitude estimation module is used for determining the estimated coordinates of all key points of the target body in the image to be analyzed based on the target positioning graph and the target offset graph and estimating the attitude of the target body according to the estimated coordinates of all the key points.

11. An electronic device, characterized in that the electronic device comprises:

at least one processor; and

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of training a pose estimation model according to any one of claims 1-7 and/or the method of pose estimation according to claim 8.

12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores computer instructions for causing a processor to execute the method for training a pose estimation model according to any one of claims 1-7 and/or the method for pose estimation according to claim 8.