CN116824631A

CN116824631A - Attitude estimation method and system

Info

Publication number: CN116824631A
Application number: CN202310702759.3A
Authority: CN
Inventors: 吴晓; 胡文莉; 李威; 乔建军; 何廷全; 胡东风
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2023-06-14
Filing date: 2023-06-14
Publication date: 2023-09-29
Anticipated expiration: 2043-06-14
Also published as: CN116824631B

Abstract

The invention relates to the technical field of computer vision and discloses a posture estimation method and a posture estimation system. The invention solves the problems that the fine prediction of the attitude estimation is difficult to realize, the accuracy of the attitude estimation is influenced and the like in the prior art.

Description

Attitude estimation method and system

Technical Field

The invention relates to the technical field of computer vision, in particular to a posture estimation method and a posture estimation system.

Background

The attitude estimation is mainly applied to the field of real people, and the current technology has the following problems in the cartoon field:

(1) The human pose estimation method mostly depends on the distribution of training samples (source and target), and does not consider the possible occurrence of interference factors. Since the performance of the model is easily affected by the distribution bias of the data set, most of the pose estimation achieves good effect on the real person data set, but these methods are based on training with simple or realistic pictures. The cartoon data is often accompanied with the problem of uneven distribution, such as single color and simple lines of the simple drawing; the cartoon character texture clothes are complex; the graffiti characters are colorful, the background is difficult to distinguish, and the like, and the difference makes a training sample and a verification sample have great differences. The resulting non-correspondence of source and target is difficult to compensate by simple data augmentation or data supplementation. For example, in training, the picture type is cartoon animals, but the verification type is cartoon characters, and the strong difference between samples causes that the trained model is difficult to adapt to pictures of different styles. At the same time, the differences in feature distribution and expression caused by such differences further make feature extraction difficult.

In addition, cartoon datasets are often accompanied by various confusing information and interference factors, such as color mixing, dim light, inadequate limb shape, etc., due to their subjectivity. The confusing information of the picture may cause blurring of edges in the image, which may make it difficult for features such as corner points to be accurately detected, thereby affecting the accuracy of pose estimation. The cartoon data set is often not easy to predict due to the characteristics of exaggerated image, disordered texture, disordered background and the like besides the problems of shielding, different postures, different scales and the like of other data sets. When the pictures have multiple feet and few legs, and limbs, wings and the like are also included in the pictures, erroneous estimation is easy to cause. These problems must be addressed in a targeted design. An example of a specific problem is with reference to fig. 1.

(2) In the field of computer vision, image processing often faces a problem: a model is used to identify a plurality of objects of different dimensions. The multi-scale problem exists in almost all computer vision tasks, and the common real person pose estimation itself has the multi-scale problem, but the multi-scale problem of cartoon characters is more serious. The multi-scale problem of a real person is often caused by different scales of parts corresponding to different types of key points or cross pictures, for example, due to the problem of lens positions, a human body can appear in an image in different scales, sometimes only part of the body can appear, and sometimes the whole body appears when standing at a distance. Furthermore, different body parts may have different dimensions, e.g. the dimensions of a person's hand and face are typically small, while the dimensions of the body and legs are large, so that the pose estimation needs to be performed taking into account the different dimensions of the body parts. The problem of multiple scales of cartoon characters is also caused by similar reasons, for example, some pictures have very small and mini roles, and some pictures have larger subjects, and the scales of the two subjects are very different. Many simple character heads may be up to 80% and the limbs are only a short straight line, in which case the scale difference between the key points is very large.

In addition, different cartoon characters have serious problem of different scale ratios of the same part, such as the head scale of a sponge baby and small head dad are obviously different, and the limbs of cartoon characters are long, but small animals tend to have round, short and small limbs, and the like. And, the scales of different parts and the proportions of different cartoon characters are greatly different from each other, which further causes the scales of the cartoon characters to become extremely complicated, compared with the case where the persons have approximately the same figure scale and proportion. The cartoon data set has more obvious scale non-uniform performance due to the difference of data types, and besides the problem that the texture characteristics of a real person are lost, the cartoon character often does not have a common human body structure, so that the cartoon character does not have uniform scale of the real person data set on the human body structure. Unlike a general distribution and scale of a real person's head, extremities, and torso, the exaggerated elements of the cartoon dataset result in a wide variety of scales that may occur, so that the dimensional differences between the different roles are more pronounced. Specific dimensional issues refer to fig. 2.

(3) The existing attitude estimation method HRNet generates reliable high-resolution representation by repeatedly fusing representations generated through high-to-low sub-networks so as to improve performance, and the whole network structure focuses on the fusion of multi-scale features of the same level. In addition, the HRNet only performs scale fusion, lacks the refinement treatment of the features, and does not pay attention to some areas with complex scales. The network structure of HRNet refers to fig. 3.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a posture estimation method and a posture estimation system, which solve the problems that the fine prediction of posture estimation is difficult to realize, the accuracy of posture estimation is influenced and the like in the prior art.

The invention solves the problems by adopting the following technical scheme:

a posture estimation method carries out multi-level progressive feature fusion on a network model, constrains the network model through weight information of key points, and takes fusion features, distributed weights and losses of a plurality of stages as output of the network model.

As a preferred technical scheme, the method comprises the following steps:

s1, constructing a data set: constructing a data set for training and reasoning of the network model;

s2, feature fusion: and carrying out multi-level progressive feature fusion on the HRNet network model, and then training and reasoning the HRNet network model after fusion on a data set.

In step S2, the low-level information of the HRNet network model is supplemented by a feature fusion method of sequentially crossing layers and weakening layer by layer from the back to the front, so as to perform reverse feature fusion.

In the step S2, the HRNet network model is divided into N stages, the N stages are sequentially marked as stages 1 to stageN along the cartoon data transmission direction, and feature fusion of different scales is carried out only in the same stage; the final output characteristics of the stageN are used for carrying out reverse characteristic fusion to the previous stage; wherein N is more than or equal to 2 and N is an integer.

As a preferred technical solution, in step S2, the HRNet network model includes a polarization attention module; wherein the polarization attention module is used for focusing on key point information in input.

As a preferred technical solution, in step S2, the weight calculation formula of the channel branches of the polarization attention module is as follows:

wherein X represents an input feature, A ^ch (X) represents the weight of the channel branch, W _v (. Cndot.) represents the 1x1 convolution, σ, of the v branch ₁ (. Cndot.) represents the first Reshape operator, W _q (. Cndot.) represents the q branch 1x1 convolution, σ ₂ (. Cndot.) represents the second Reshape operator, W _z (. Cndot.) represents the convolution of 1x1 for the z-branch, F _LN (. Cndot.) represents LayerNorm operation, F _SM (. Cndot.) represents a softMax operation, x represents a matrix dot product operation, F _SG (. Cndot.) represents a Sigmoid operation;

the weight calculation formula of the spatial branches of the polarized attention module is as follows:

wherein A is ^sp (X) represents the weight of the spatial branch, F _GP (. Cndot.) represents global pooling operations, σ ₃ (·) represents the third Reshape operator.

As a preferred technical scheme, the method further comprises the following steps:

s3, constructing a memory structure: constructing a memory structure, and performing memory matching on the features fused at different stages in the memory structure; the memory structure is used for supplementing priori knowledge, is a readable storage space, can store partial characteristics or information in a targeted manner, can perform experience replay in the training process of the HRNet network model, and can obtain the distributed weight of MSE loss of the HRNet network model through stored memory.

As a preferred technical solution, step S3 includes the following steps:

s31, selecting a storage unit capable of reading and writing information as a memory structure, wherein the format of a memory slot of the storage unit is defined as follows: (Prediction, target), …, (Prediction, target); the Prediction represents the gesture estimation Prediction thermodynamic diagrams of different types of roles, and the Target represents the gesture estimation annotation true value thermodynamic diagrams of different types of roles;

s32, taking an attitude estimation prediction thermodynamic diagram of the HRNet network model as a query condition of a memory structure, and calculating the similarity of memory and query terms according to Euclidean distance, wherein a similarity calculation formula is as follows:

Distance＝Q ² -2×Q×M ^T +M ² ；

wherein Distance represents similarity, Q represents query term, M represents memory feature tensor stored in memory structure, and T represents transposition operation;

s33, obtaining a predictive thermodynamic diagram and a labeling true thermodynamic diagram of the front K bits with high-low similarity, and enabling the HRNet network model to take out K pairs of predictive values and true values from the memory structure through similarity calculation; wherein K is more than or equal to 2 and K is an integer;

s34, calculating MSE loss of K pairs of memories, obtaining distribution loss of prediction conditions of key points of similar memories, and processing the loss into weights through normalization; the K pairs of memory finger similarity are the predictive thermodynamic diagrams of the first K bits from high to low and the true value thermodynamic diagrams are marked.

An attitude estimation system for implementing the attitude estimation method comprises the following modules:

the data set construction module: to construct a dataset for training and reasoning of the network model;

and a feature fusion module: and the method is used for carrying out multi-level progressive feature fusion on the HRNet network model, and then training and reasoning the fused HRNet network model on a data set.

As a preferred technical scheme, the device further comprises the following modules:

the memory structure building module: the method comprises the steps of constructing a memory structure, and performing memory matching on the features fused at different stages in the memory structure; the memory structure is used for supplementing priori knowledge, is a readable storage space, can store partial characteristics or information in a targeted manner, can perform experience replay in the training process of the HRNet network model, and can obtain the distributed weight of MSE loss of the HRNet network model through stored memory.

Compared with the prior art, the invention has the following beneficial effects:

(1) According to the invention, a multi-level progressive feature fusion memory network is realized, features at different stages are reversely fused and progressive layer by layer on the basis of a memory structure, and the fused features at different stages are subjected to memory matching in the memory structure so as to obtain more accurate key point weight information, thereby helping the fine prediction of attitude estimation;

(2) According to the invention, the cross-level fusion characteristics are sent into a memory mechanism to obtain weight information, the fusion information of a plurality of stages is used as the output of a gesture estimation network structure, and the constraint and adjustment of the whole network are carried out through MSE loss and the weight information in the memory structure; the information which is completely fused in a plurality of stages is subjected to the writing operation of the memory structure, so that the information in the memory structure is ensured to be continuously updated; therefore, shallower spatial information is obtained through cross-level feature fusion, meanwhile deep semantic information is reserved, and memory is read and written by assistance of a memory structure, so that training difficulty caused by the scale problem is further solved.

Drawings

FIG. 1 is a schematic diagram of a problem with conventional cartoon character pose estimation;

FIG. 2 is a schematic diagram of a multi-scale problem in pose estimation;

FIG. 3 is a network structure diagram of HRNet;

FIG. 4 is a graph of the number of cartoon characters of each type;

FIG. 5 is a schematic diagram of various types of pictures of a dataset;

FIG. 6 is a network structure diagram of a memory structure;

FIG. 7 is a diagram of a multi-level progressive feature fusion and refinement network;

FIG. 8 is a diagram of a polarized self-attention network architecture;

FIG. 9 is a diagram of a multi-level progressive feature volume and refinement detailed network architecture;

FIG. 10 is a graph comparing results of HRNet model based on memory structure and HRNet model reasoning;

FIG. 11 is a graph of a model of multi-level progressive feature fusion improvement versus an HRNet model inference comparison;

FIG. 12 is a graph comparing model reasoning results of a model and an HRNet after combining a memory structure and multi-level progressive feature fusion;

FIG. 13 is a diagram of a multi-level progressive feature fusion memory network.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but embodiments of the present invention are not limited thereto.

Example 1

As shown in fig. 1 to 13, due to the current lack of a data set for the cartoon domain, a cartoon data set carton-rolls is first constructed, which contains 5 different categories of simple figures, two-dimensional cartoon characters, graffiti drawings and the like, and each subject corresponds to 19 skeleton point (key point) labels and target detection frame labels, and total data is 5873 pieces of data. The dataset is used for training and reasoning of the subsequent model. The distribution of the data sets is shown in fig. 4.

5873 pieces of data of the dataset were divided into training and validation sets at a ratio of approximately 9:4. The training set has 4068 pictures, including 4129 subject detection frames. The validation set 1805 pieces of data include 1822 detection subjects. The dataset is mainly a single person or object, and multiple targets exist in a small number of pictures. In order to perform sample equalization on the experimental data set, the data set is subjected to equalization division, and the distribution condition of the training set and the verification set is ensured to be basically consistent. Specific examples of various types of pictures of a dataset refer to fig. 5. The distribution of various pictures in the training set and the verification set is shown in table 1.

Table 1 distribution table of various pictures in training set and verification set

For the labeling mode of 17 key points, which is different from COCO labeling, two bone points of the top of the head and the chin are added, 19 key points of the nose, the left eye, the right eye, the left ear, the right ear, the left shoulder, the right shoulder, the left elbow, the right elbow, the left wrist, the right wrist, the left hip, the right hip, the left knee, the right knee, the left ankle, the right ankle, the top of the head and the chin are accumulated, and meanwhile, the whole detection frame of the target is labeled. The left-right standard is based on the left-right of the character itself.

Aiming at the problems that cartoon data samples are unevenly distributed, the complexity of the cartoon samples is far greater than that of real person data, shielding exists, detection is difficult due to light and the like, priori knowledge is supplemented through a memory structure. The memory structure is used as a readable storage space, and can store part of characteristics or information in a targeted manner, so that a model can conveniently utilize the stored information. In the model training process, experience replay is carried out through a memory module, and the distribution weight of the key points is obtained through stored related difficult memories, so that the model applies more attention to the key points which are difficult to predict, and the problems of difficult prediction caused by the problems of shielding, complex gesture, different character styles and the like are solved. For example, the knee shielding problem can be solved by applying more attention to the knee key points caused by the posture of kneeling and the like.

The description about "key points" is as follows:

COCO keypoint track it is one of authoritative public games for human body key point detection, and the COCO data set respectively uses 17 human body key point atmospheres, namely a nose, a left eye, a right eye, a left ear, a right ear, a left shoulder, a right shoulder, a left elbow, a right elbow, a left wrist, a right wrist, a left hip, a right hip, a left knee, a right knee, a left ankle and a right ankle.

Aiming at common multi-scale problems and the problem of neglecting shallow information by HRNet feature fusion, a network structure of multi-level progressive feature fusion is designed. Often the resolution of lower level features is higher, focusing on details and textures, containing rich information, but because it passes through fewer convolutional layers, semantically is worse, noise is more, while higher level features focus on semantic information. Therefore, the combination of various different layers of features is beneficial to the supplement and perfection of different features, and the precision of the model is improved. The method is characterized in that low-level information is supplemented by means of cross-layer feature fusion in sequence, high-level features are taken as a main component, shallow-layer information is taken as an auxiliary component, feature fusion is sequentially carried out in a layer-by-layer weakening mode from back to front (after features of different levels in the same stage are fused from back to front, the features of the stage are used as the features of the stage, and the features of the stage are continuously fused with the features of the previous stage. Meanwhile, the network is expected to be capable of seeing the whole world in the feature fusion process, but focus on key information, and attention mechanisms and multi-scale feature fusion are combined, so that feature propagation of the depth network is further enhanced. The polarization self-attention PSA is introduced, attention (attention means that the model applies more attention to key information in the learning process, the learning effect is better.) key information in input is focused, the expression capability of the model in the process of learning feature representation is enhanced, and feature refinement processing is completed.

The two solutions are combined to realize a multi-level progressive feature fusion memory network. And carrying out reverse fusion and progressive layer by layer on the features at different stages on the basis of the memory structure, and carrying out memory matching on the fused features at different stages in the memory structure to obtain more accurate key point weight information so as to help the refined prediction of the attitude estimation.

The main technical features used here include two: memory structure and feature fusion.

A memory structure: the memory structure is used as a storage unit, and can read and write information. The memory structure is mainly used for storing, and the format of the memory slot is defined as (Prediction, target), …, (Prediction, target). Wherein the Prediction represents the gesture estimation Prediction thermodynamic diagram of different types of roles and the labeling true value corresponding to the Target. When carrying out the gesture estimation prediction of a picture, taking the HRNet gesture estimation prediction thermodynamic diagram as a memory structure query condition Q, and firstly calculating the similarity of a memory and a query term according to Euclidean distance in a memory structure, wherein the similarity calculation formula is as follows:

Distance＝Q ² -2×Q×M ^T +M ² ；

wherein Q is a query term, namely a feature tensor output by the pose estimation HRNet, M is a memory feature tensor stored in a memory structure, and T is a transposition operation.

Obtaining the predicted thermodynamic diagram and the true thermodynamic diagram of the front K bits with higher similarity, and taking out meaningful memories from the memory module by the network through similarity calculation, namely taking out K pairs of predicted values and true values. And then calculating MSE loss by using K groups of memories to obtain distribution information of prediction conditions of each key point of the similar memories. The original loss weight of the attitude estimation is obtained through the correlation memory, so that the attitude estimation is focused on key points which are difficult to predict by the role. Finally, in order to construct a meaningful memory module, the chapter measures the difficulty level of the picture through the loss of the attitude estimation, so as to determine whether the memory needs to be stored. The information stored in the memory structure is continuously updated in a circular queue mode so as to keep the size and the size of the whole memory unchanged, and the redundancy of old information is avoided. Each group of predictors and targets occupies one unit in the memory structure, respectively. The specific network structure diagram is shown in fig. 1, wherein the join Loss is the MSE Loss estimated by the original HRNet gesture, the Memory Loss is the key point Loss calculated by similar difficult Memory, and Batch Size Loss List is the MSE Loss array of the current training data, and is used for screening the difficult samples for Memory storage. The memory is always stored and read in the form of (Prediction, target), and the specific memory network structure is shown in fig. 6.

The meaning of the english word in fig. 6 is as follows:

joint Loss: each keypoint MSE loss, batch Size Loss List: MSE penalty array for one Batch Size, difficult Data: difficult samples, write: writing (writing the content into the memory structure), prediction: predicting thermodynamic diagram, target: labeling true values, memory Weight: memory weights, memory Loss: memory MSE Loss, loss: total loss, topK: taking the ID of the previous K-bit Distance, the Query By Id: inquiring the memory according to the ID, read: read (read memory structure), effective id Range: valid id range, writed: writing a position index (an index of a position where writing starts when writing a memory), add: a location index is added.

Feature fusion: HRNet maintains high resolution on one branch all the time, and a sub-network with low resolution is generated in each stage in a resolving way, so that HRNet is divided into 4 stages, stage1, stage2, stage3 and stage4. Because the 4 stages of HRNet only carry out feature fusion of different scales in the same stage, namely, fusion of a plurality of subnet features in the same stage is carried out, in order to acquire more space information, feature fusion is carried out continuously to the previous stage by taking the final output feature of stage4 as the main feature in a reverse fusion mode. The specific multi-level progressive feature fusion and refinement network structure is shown in fig. 7. The multi-level progressive feature fusion method in the experiment has the following detailed flow.

(1) Combining the 1×1 convolutions, BN, reLU to form a structure Conv1; the 3 x 3 convolutions, BN, reLU and PSA attention modules were assembled into a new structure Conv2, wherein the PSA architecture diagram is referenced in fig. 8.

The weight calculation formula of the channel branches of the PSA attention module is as follows:

wherein W is _v (·)、W _q (X) and W _z|θ1 (. Cndot.) is a 1X1 convolutional layer, σ ₁ (. Cndot.) and sigma ₂ (. Cndot.) is the Reshape operator, F _SM (. Cndot.) is a softMax operation, x is a matrix dot product operation, ch is a channel-level multiplication operator, W _z (. Cndot.) represents the convolution of 1x1 for the z-branch, F _LN (. Cndot.) represents LayerNorm (layer normalization) operation, F _SG (. Cndot.) represents a Sigmoid operation, F _SG (. Cndot.) all parameters were kept within the (0, 1) range.

The weight calculation formula of the spatial branches of the PSA attention module is as follows:

wherein W is _v (·)、W _q (X) is a standard 1X1 convolutional layer, σ ₁ (·)、σ ₂ (. Cndot.) and sigma ₃ (. Cndot.) is three Reshape operators, F _SM (. Cndot.) is the SoftMax operator, F _GP (. Cndot.) is a global pooling, sp is a spatial multiplication operator. The PAS (polarization self-attention module) used here is originally used for a residual module, but is mainly used for feature refinement processing in cross-layer fusion in consideration of the problem of model calculation complexity.

(2) The output of stage1 channel 256 goes through Conv1 down channel 8.

(3) The two different resolution outputs of stage2 are reduced to 1/2 of the original channel, i.e. the high resolution subnet characteristic of 48 of the original channel and the low resolution subnet characteristic of 96 of the channel number are reduced to 24 and 48 respectively. And unifying the two different feature scales for feature fusion. The feature of 72 channels after combination is processed into 16 channels by Conv1 as the output of stage 2.

(4) The three outputs of stage3 are respectively processed into original 1/3 channel transmission through Conv1, namely, the 48, 96 and 192 channel numbers of the three sub-network features are respectively compressed into 16, 32 and 64. The three output features are fused, and the feature that the output of the third stage is 112 channels is processed into 32 channels as the output of stage3 through Conv 1.

(5) Because stage4 of HRNet is already processed as one Output through feature fusion, the Output of stage4 after feature fusion is guaranteed to be the number of key points through a convolution layer, the thermodynamic diagram at the moment is taken as Output1, and the corresponding loss1 is obtained according to MSE.

(6) By means of the Conv1 and Conv2 fusion of the characteristics of the Output and stage3 of HRNet with stage4 channels of 48, the number 19 of the last Output channels as key points, namely Output2, is always kept through a convolution layer. Loss of loss2 through MSE.

(7) And by analogy, the features after the current fusion and the fusion features of the previous stage are subjected to feature fusion again, and the fusion features ensure that the final Output channels are the number of key points through a convolution layer, so that Output3 and Output4 and corresponding losses loss3 and loss4 are obtained.

(8) Output1, output2, output3 and Output4 are respectively used as query conditions, and corresponding memories are found in the memory structure to obtain weights corresponding to the four losses.

(9) The loss of 4 times fusion is added as the overall loss, and the whole network is constrained. The loss of the entire structure is calculated. The overall loss calculation formula is as follows:

wherein loss is _i Loss value, weight, calculated through MSE for Output obtained by corresponding to ith feature fusion _i Is a memory knotAnd the Loss is a Loss value of cross-layer fusion (Loss of an integral model of a multi-level progressive feature fusion memory network).

(10) Taking MSE loss of Output4 as a storage basis of a memory structure, and memorizing and storing the prediction thermodynamic diagram with larger loss and the standard value.

The detailed feature processing and feature fusion modules of the 4 stages are shown in fig. 9.

Memory structure + feature fusion: a multi-level progressive feature fusion memory network scheme is provided, which is realized by an HRNet, a memory module and a multi-level progressive feature fusion module. Firstly, a multi-level progressive feature fusion method is added in a network structure, and a self-attention mechanism is introduced when feature fusion is carried out in adjacent stages. Based on the original network structure, each stage is designed to perform feature fusion, namely, the output of each stage is compressed through a convolution layer in equal proportion, a plurality of features of each stage are fused, the fused features are fused with features of the previous stage in sequence from stage4, feature learning is performed through a 3×3 convolution, BN, reLU and PSA attention, channel compression is performed through a 1×1 convolution, and the fused features are output as Heatm of 19 skeleton points through the 1×1 convolution. Thus, the fusion characteristics of different stages, namely the Output characteristics (Output 1) of the satge4, the characteristics (Output 2) after the fusion of the stage4 and the stage3, the fusion characteristics (Output 3) of the stage4, the stage3 and the stage2, and the characteristics (Output 4) after the reverse fusion of the four stages, are obtained.

And then, on the basis of multi-level progressive feature fusion, combining the memory structure to perform multi-level memory structure read-write operation. And taking four different fusion characteristics as query conditions, performing reading operation in a memory structure, and finding similar difficult samples to perform weight calculation. And judging the training difficulty degree of the current data by taking the total characteristic (Output 4) loss value after the reverse fusion of the four stages as a basis, thereby storing the memory module. And 4 fusion features and true values are used for calculating MSE loss of the key points, so that 4 losses are obtained. And simultaneously, taking the four output characteristics as query conditions, finding out similar memories in a memory structure, and respectively weighting four losses through the weights obtained by the memories, so that the losses of the whole attitude estimation are obtained to restrict the whole attitude estimation.

In general, the cross-level fusion features are sent to a memory mechanism to obtain weight information, and the four-stage fusion information is used as the output of the attitude estimation network structure, and the constraint and adjustment of the whole network are carried out through MSE loss and the weight information in the memory structure. And (3) completely fusing the four stages of information to perform writing operation of the memory structure, and ensuring that the information in the memory structure is continuously updated. Therefore, shallower spatial information is obtained through cross-level feature fusion, meanwhile deep semantic information is reserved, and memory is read and written by assistance of a memory structure, so that training difficulty caused by the scale problem is further solved. The model reasoning comparison result of the whole network is shown in fig. 12, and the whole network structure diagram is shown in fig. 13.

Example 2

As further optimization of embodiment 1, as shown in fig. 1 to 13, this embodiment further includes the following technical features on the basis of embodiment 1:

the main reason why the HRNet is mainly used as the basic network structure of the posture estimation is that in the common Top-Down posture estimation method, the HRNet shows a certain advantage.

(1) By means of a Memory Structure (MS), the estimated attitude of HRNet AP is improved by 1.9%. The problems of shielding, false detection, missing detection and the like are improved to a certain extent. The specific experimental data are shown in table 2.

TABLE 2 cartoon pose estimation experiment result table based on memory structure

The three parameters of the number of the memory structures which can be read through one query term, the number of the pictures written into the memory structures in each batch and the normalized range when the weight is calculated are respectively tested, and the data are shown in tables 3 to 5.

Table 3 memory structure write parameters vs. experimental results table

Table 4 memory structure normalized range parameter comparison experiment results table

Table 5 memory structure read parameter versus test results table

The introduction of the memory structure improves the attitude estimation to a certain extent, but the improvement effect is different along with the change of the parameters, so that the performance is obviously improved under the condition that the parameters are relatively proper, and the accuracy of 1.9 points can be improved. The rationality and effectiveness of the memory structure are further proved by experimental data, and the problems of the difference and the difficulty of the cartoon data set are solved to a certain extent. Model reasoning effects for specific classes of cartoon characters refer to figure 10.

(2) The characteristics of the four stages of HRNet are subjected to characteristic fusion in a cross-layer mode, a self-attention module is introduced during cross-layer fusion, the precision is also greatly improved, specific experimental data are shown in a table 6, and a multi-level progressive characteristic fusion network (Muti-LPNet) can be respectively improved by 1.8% and 2.3% without adding the self-attention module and without adding the attention module. Model reasoning effects for specific classes of cartoon characters refer to figure 10.

Table 6 Multi-level progressive feature fusion network experimental result table

(3) The memory module, the multi-level progressive characteristic fusion and the multi-level progressive characteristic fusion memory network structure of two points of the refined network are combined, original HRNet experimental data are compared with the original HRNet experimental data as shown in table 7, and the overall accuracy of the HRNet is improved by 3.1%. Model reasoning effects for specific classes of cartoon characters refer to figure 11.

Table 7 Multi-level progressive characteristic fusion memory network structure experimental result table

The experimental software and hardware environments are shown in table 8.

Table 8 experiment software and hardware environment table

The human body posture estimation algorithm adopted in the experiment is an HRNet model, and is different from the labeling mode of 17 commonly used COCO key points in the data labeling process, two bone points of the top of the head and the chin are added on the basis, and the total number of the bone points is 19, so that the number of the bone points is set to 19, and the number of estimated half body key points is 8. The experimental epoch was set to 210, the test and validation stages Batch Size were set to 16 for memory considerations, the model training was performed simultaneously on both cards using HRNet-W48 (256 x 192) using the pre-training model, the initial learning rate was set to 1e-3 using Adam optimizer, and at 170 and 200epoch down to 1e-4 and 1e-5, respectively, with the remaining parameters remaining consistent with the HRNet official open source parameters.

For the improved memory structure, the main parameters are the storage size N of the memory structure, the number K of similar memory extractions, the number S of memory structures written each time and the normalization range for performing normalization weight calculation after reading the memory. Here, considering the upper limit of the memory, the memory size of the memory structure is set to 2000, that is, the memory is overwritten every time 2000 pairs of memories are stored. The number of the read similar memories is respectively tested at 8, 16, 24 and 32, and since the Batch Size is 16, the range of each writing cannot exceed the total number of the current training pictures, and therefore the number of the pictures written in each Batch of the memory structure is 3, 6, 12 and 16, and when the memory structure is normalized to be weighted, the normalization ranges are respectively 0.9 to 1,0.7 to 1 and 0.5 to 1.

For the experiment of the whole network structure, the normalization range is 0.5-1, the read memory size is 8, the write memory structure size is 12, and the memory module storage space size is 2000. The channels of the fused features from stage1 to satge4 in the process of reverse fusion are 8, 16, 32 and 48 respectively. The two scale feature downscaling channels of stage2 are originally 1/2, and the three scale feature channel features of stage3 are compressed to originally 1/3. And the number of channels is kept consistent with the number of characteristic channels of stage4 in each forward fusion process, and 256×192 of experimental pictures in this chapter are always kept, so that the number of characteristic channels after each cross-level characteristic fusion is 48.

As described above, the present invention can be preferably implemented.

All of the features disclosed in all of the embodiments of this specification, or all of the steps in any method or process disclosed implicitly, except for the mutually exclusive features and/or steps, may be combined and/or expanded and substituted in any way.

The foregoing description of the preferred embodiment of the invention is not intended to limit the invention in any way, but rather to cover all modifications, equivalents, improvements and alternatives falling within the spirit and principles of the invention.

Claims

1. A posture estimation method is characterized in that multi-level progressive feature fusion is carried out on a network model, the network model is constrained through weight information of key points, and fusion features, distributed weights and losses of multiple stages are used as output of the network model.

2. A method of estimating a pose according to claim 1, comprising the steps of:

3. The method according to claim 2, wherein in step S2, the inverse feature fusion is performed by supplementing the low-level information of the HRNet network model in a feature fusion manner of sequentially crossing layers and weakening layer by layer from the back to the front.

4. The gesture estimation method according to claim 3, wherein in step S2, the HRNet network model is divided into N stages, the N stages are sequentially recorded as stage1 to stage N along the cartoon data transmission direction, and feature fusion of different scales is performed only in the same stage; the final output characteristics of the stageN are used for carrying out reverse characteristic fusion to the previous stage; wherein N is more than or equal to 2 and N is an integer.

5. The method according to claim 4, wherein in step S2, the HRNet network model includes a polarization attention module; wherein the polarization attention module is used for focusing on key point information in input.

6. The method according to claim 5, wherein in step S2, a weight calculation formula of a channel branch of the polarized attention module is as follows:

A ^ch (X)＝F _SG (F _LN (W _z ((σ ₁ (W _v (X)))×F _SM (σ ₂ (W _q (X))))))；

wherein X represents an input feature, A ^ch (X) represents the weight of the channel branch, W _v (. Cndot.) represents the 1x1 convolution, σ, of the v branch ₁ (. Cndot.) represents the first Reshape operator, W _q (. Cndot.) represents the q branch 1x1 convolution, σ ₂ (. Cndot.) represents the second Reshape operator, θ ₁ Represents LayerNorm operation, W _z (. Cndot.) represents the convolution of 1x1 for the z-branch, F _LN (. Cndot.) represents LayerNorm operation, F _SM (. Cndot.) represents a softMax operation, x represents a matrix dot product operation, F _SG (. Cndot.) represents a Sigmoid operation;

A ^sp (X)＝F _SG [σ ₃ (F _SM (σ ₁ (F _GP (W _q (X))))×σ ₂ (W _v (X)))]；

7. A method of estimating a pose according to any of claims 2 to 6, further comprising the steps of:

8. The attitude estimation method according to claim 7, wherein the step S3 includes the steps of:

Distance＝Q ² -2×Q×M ^T +M ² ；

9. A pose estimation system, characterized by being adapted to implement a pose estimation method according to any of claims 1 to 8, comprising the following modules:

10. The attitude estimation system of claim 9, further comprising the following modules: