CN113298047A

CN113298047A - 3D form and posture estimation method and device based on space-time correlation image

Info

Publication number: CN113298047A
Application number: CN202110728994.9A
Authority: CN
Inventors: 王文东; 孙逸典; 张继威; 田野; 阙喜戎; 龚向阳
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-08-24
Anticipated expiration: 2041-06-29
Also published as: CN113298047B

Abstract

The invention provides a method and a device for estimating 3D form and posture based on a space-time correlation image, wherein the method comprises the following steps: inputting a plurality of image frames with time or space correlation; carrying out image feature extraction on an input image frame by using an image feature extraction network to obtain a corresponding feature vector; extracting time sequence or space characteristics of the characteristic vectors corresponding to the image frames by utilizing a space-time sequence characteristic extraction network in combination with an attention mechanism to obtain image characteristic vectors at different moments or positions; inputting the picture feature vectors at different moments or positions into a regression model comprising a multilayer perceptron model to obtain estimation results of all the moments or positions, wherein the estimation results of all the moments or positions comprise three-dimensional information of each key point of all the moments or positions. The method and the device of the embodiment of the invention can reduce the error of the estimation result and the acceleration error of the estimation result, thereby reducing the jitter degree of the estimation result.

Description

3D form and posture estimation method and device based on space-time correlation image

Technical Field

The invention relates to the technical field of computer vision, in particular to a method and a device for estimating 3D form and posture based on a space-time correlation image, and particularly relates to a method and a device for estimating 3D form and posture based on a space-time correlation image under an occlusion scene and solving jitter of an estimation result.

Background

At present, the shape and posture estimation algorithm aims at obtaining the shape and posture of a human body or an object by taking a video or a series of spatially related pictures as input, has great application value in the fields of human-computer interaction and augmented reality, and has good application prospect both for entertainment and production. The existing form and posture estimation methods are mainly divided into two types: 1) respectively training a mapping network from the 2D key points to the 3D key points by using the 2D key points as features; 2) and taking the picture appearance feature vector as an input and then directly regressing to obtain 3D rotation information.

Chinese patent No. ZL202010717560.4 discloses a method for three-dimensional reconstruction of human body under occlusion, which uses a single frame RGB-D image, first uses example segmentation to obtain pixel masks of human body part and occlusion object part of the image, and then uses the mask to segment depth image. And then, carrying out attitude estimation and three-dimensional reconstruction on the shielding object by using the convolutional neural network, and then carrying out three-dimensional reconstruction on the human body by using a three-dimensional model, a color image and a human body depth image of the shielding object. The accuracy and the reliability of the human body posture estimation under the condition of object shielding are improved. However, for a self-occlusion scene of a human body, the method is not suitable, and the estimation result of the existing method in the self-occlusion scene usually has a large estimation error.

The Chinese patent application with the application number of CN202010991889.X discloses a human body posture estimation method based on an hourglass network and an attention mechanism, and the method is used for improving the 2D key point detection accuracy by combining the hourglass network with image global attention and local attention. The attention mechanism of the human body posture estimation method is an image level attention mechanism, the task of the method is to estimate 2D key points of a human body, and the method is not suitable for a 3D human body posture estimation method.

The Berkeley team and Mapu propose an end-to-end network model, which can recover the motion parameters of a 3D human body from a single RGB image. The network model firstly needs to extract image features through resnet50, then 3D human motion parameters are directly regressed through a regression network, and a discriminator of a CNN structure is introduced to supervise the regression network. However, the antagonism training introduced in the prior art is poor in stability, and experimental data show that the antagonism training introduced in the prior art is insufficient and is analyzed from an intuitive angle: when the loss function of the discriminator is 0, the discriminant loss function is proved to be optimal, the generated sample and the real sample can be perfectly distinguished, and the generated sample distribution cannot be close to the real sample until the training is finished, so that the capacity of the generator is poor.

How to estimate 3D morphology and posture and solve the jitter of the estimation result based on video or picture input in an occlusion scene is a problem to be solved.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for estimating a 3D form and a pose based on a spatio-temporal correlation image, which perform a form and a pose estimation on a human body or an object therein by using a temporally or spatially correlated image as an input, so that an error of an estimation result in the case of an object being occluded is lower than that of the existing method.

According to one aspect of the invention, a 3D shape and posture estimation method based on space-time correlation images is provided, which comprises the following steps:

inputting a number of image frames including a subject with temporal or spatial correlation;

carrying out image feature extraction on input image frames by using an image feature extraction network to obtain feature vectors corresponding to the image frames;

extracting time sequence characteristics or space characteristics of the characteristic vectors corresponding to the image frames by utilizing a space-time sequence characteristic extraction network in combination with an attention mechanism to obtain image characteristic vectors at different moments or different positions;

inputting the picture feature vectors at different moments or different positions into a regression model comprising a multilayer perceptron model to obtain estimation results of all the moments or positions, wherein the estimation results of all the moments or positions comprise three-dimensional information of each key point at all the moments or positions.

In some embodiments of the invention, the method further comprises: obtaining the three-dimensional coordinate of each key point by the three-dimensional information of each key point through a parameterized evaluation model, and respectively taking the obtained three-dimensional coordinate of each key point and the three-dimensional coordinate of each key point obtained based on the truth value of the three-dimensional information in the data set as the input of a discriminator to respectively obtain an estimation result score and a real space-time sequence score; and respectively using the obtained estimation result score and the real space-time sequence score as a loss function of a generator and a loss function of a discriminator to perform back propagation so as to realize antagonistic learning.

In some embodiments of the present invention, the performing space-time sequence feature extraction on feature vectors corresponding to each image frame by using a space-time sequence feature extraction network in combination with an attention mechanism to obtain image feature vectors at different times or positions includes:

the hidden states of different moments or different positions output by the time-space sequence feature extraction network are used as the input of an attention mechanism module, 3 different projection data representations are obtained by calculating the projection of the hidden states, and the 3 different projection data representations are used for transforming the input hidden states through 3 learnable parameter matrixes; calculating the correlation of different moments or different positions by utilizing the first projection data representation and the second projection data representation, and calculating the weight of hidden states of different moments or different positions based on the correlation; the third projection data representation is weighted with the calculated weights as an output of the attention mechanism module. By the introduction of the attention mechanism, the extraction of the time sequence or space characteristics is more sufficient, so that the jitter degree of the estimation result is reduced, namely the acceleration error is reduced.

In some embodiments of the invention, the method further comprises: performing data preprocessing on an initial image frame before inputting a picture or an image frame with time or spatial correlation, wherein the data preprocessing comprises the following steps: performing frame cutting operation on the initially obtained image frame to obtain a picture of each frame; down-sampling each frame of picture to obtain a first image frame; the 2D key points of each first image frame are detected, and the obtained first continuous image frame is cropped to a fixed pixel size based on the positions of the 2D key points of each first image frame to obtain an image frame including a subject.

In some embodiments of the present invention, the method further includes performing post-processing on the three-dimensional information of each key point at each time or position, where the post-processing includes: obtaining an estimation result of a 2D key point in an image frame; and constraining the three-dimensional information of the key points by utilizing the 2D key point estimation result and a preset prior condition, and minimizing an objective function through numerical optimization.

In some embodiments of the invention, the method further comprises: before post-processing the three-dimensional information of each key point at each moment or position, pre-smoothing the three-dimensional information of each key point in a mode of filtering an estimation result; and/or after post-processing the three-dimensional information of each key point at each moment, pre-smoothing the three-dimensional information of each key point in a mode of filtering the estimation result.

In some embodiments of the invention, the output of the discriminator is the evaluation of the estimation result and the real value in the data set, the discriminator and the generator are supervised by the score, the score is high or low to represent the distance between the input of the discriminator and the distribution of the true value, and the score is high to represent the distance between the discriminator and the true value; the loss function of the discriminator is the sum of the distance between the fraction of the estimated value and the minimum value of the fraction and the distance between the fraction of the real value and the maximum value of the fraction; the loss function that provides supervision for the generator is then the distance between the estimate score and the maximum of the score.

In some embodiments of the present invention, the objective function includes a 2-norm term of the 2D keypoint error, a gaussian mixture model prior term of the motion parameter, and a pre-specified motion prior term; the numerical optimization algorithm adopts an L-BFGS method to adjust the input original motion parameter estimation value, so that the original motion parameter estimation value is more consistent with the original prior, and the key point information of the evaluated body after post-processing is obtained.

In another aspect of the present invention, there is also provided a 3D morphology and pose estimation apparatus based on spatiotemporal correlated images, including a processor and a memory, wherein the memory stores computer instructions, the processor is configured to execute the computer instructions stored in the memory, and the apparatus realizes the steps of the method as described above when the computer instructions are executed by the processor.

In another aspect of the present invention, a computer-readable storage medium is also provided, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the method as set forth above.

According to the 3D form and posture estimation method and device based on the space-time correlation image, the form and posture of a human body or an object in the space-time correlation image are estimated by taking the time or space correlated image as input, so that the estimation result error under the condition that the object is shielded is lower than that of the existing method.

Further, by performing a smoothing operation on the estimation result, the estimation result error and the acceleration error of the estimation result can be reduced, thereby reducing the degree of jitter of the estimation result.

Furthermore, the invention adopts a space-time sequence feature extraction network combined with an attention machine mechanism and a discriminator based on a graph neural network, thereby improving the effect of antagonistic learning between a generator and the discriminator. The discriminator takes the estimation result sequence as input and the truth value sequence as input to respectively obtain a score of the estimation result sequence and a score of the real sequence. And calculating the loss function of the discriminator and the loss function of the generator respectively by using the two scores, and then performing back propagation to realize antagonistic learning, thereby reducing the error of the estimation result.

Further, the method is based on an optimized post-processing method, and the problem that the estimation result error is large under the condition that the object is shielded is solved. The method combines the motion prior knowledge, can automatically correct larger errors appearing in the estimation result, and can be combined with a method for smoothing the rotation parameters. The method is also a general post-processing method, can optimize the estimation result of any three-dimensional human body posture, and reduces the estimation error under the shielding condition.

The method is applicable to the fields including but not limited to human bodies, and the 3D posture estimation of animals can also adopt the method provided by the invention.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.

Drawings

Fig. 1 is a flow chart illustrating a 3D morphology and pose estimation method according to an embodiment of the invention.

FIG. 2 is a schematic overall flow chart of a 3D morphology and pose estimation method according to another embodiment of the present invention.

Fig. 3 is a schematic flow chart of a generator network model in the method according to the embodiment of the present invention.

Fig. 4 is a schematic diagram of the working principle of the discriminator based on the graph neural network in the embodiment of the present invention.

Fig. 5 is a schematic diagram of a post-processing flow for dealing with the problem of self-occlusion in the method according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.

It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.

It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.

Fig. 1 is a flowchart illustrating a 3D morphology and pose estimation method based on image frames with temporal or spatial correlation according to an embodiment of the present invention, which may be implemented by a 3D morphology and pose estimation model trained in advance. FIG. 2 is a schematic diagram of a 3D morphology and pose estimation model according to an embodiment of the invention. As shown in fig. 2, the 3D morphology and pose estimation model may include a generator network model (generator for short) and a discriminator network model (discriminator for short), feature extraction is performed on the generation network estimation result by using a discriminator network structure, feature extraction on a time sequence and feature extraction on a spatial topology structure may be simultaneously achieved, a loss function of the discriminator and a loss function of the generator may be respectively calculated based on a score of an estimation result sequence obtained by the generator network structure and a score of a real sequence in a data set, and by performing back propagation, reactive learning is achieved, thereby reducing an error of the estimation result. In an embodiment of the present invention, the mechanism of the generator network model is shown in fig. 3, and includes an appearance feature extraction network module, a spatio-temporal sequence feature extraction network module, and a regression module. The discriminator network model is also a new discriminator network structure, and the structure of the discriminator network model is as shown in fig. 4, and the new discriminator network structure is used to perform feature extraction on the estimation result (the first estimation result in fig. 2) of the generator network model, so that the feature extraction on the time sequence and the feature extraction on the spatial topological structure can be realized at the same time. In addition, as shown in fig. 2, the present invention may further perform post-processing on the first estimation result by using a post-processing module, for example, taking the 2D keypoint estimation result as an input, constraining the human body keypoint 3D information by a priori specified by a person, and minimizing the objective function by using a numerical optimization method to obtain the human body keypoint 3D information after the post-processing as a second estimation result.

As shown in fig. 1, the 3D morphology and pose estimation method implemented by the 3D morphology and pose estimation model includes the following steps:

in step S110, a plurality of image frames including a subject with temporal or spatial correlation are input.

The main body can be a human body or an animal. The following will explain a human body as an example, but the present invention is not limited thereto.

The image frame containing the subject may be, for example, a video with temporal correlation containing a moving subject or an MRI medical image with spatial correlation containing a subject, but the present invention is not limited thereto.

The present invention first requires the acquisition of an image frame. The image frames may be, for example, video or successive image frames that enable capture. During the training phase of the 3D morphology and pose estimation model, the image frames may be from a pre-established training set.

In a preferred embodiment of the present invention, the input image frames are preprocessed image frames. For example, the initially obtained image frames may be subjected to a frame cutting operation to obtain a picture of each frame, and then each obtained frame is subjected to a screening process according to requirements, such as uniform down-sampling, and the down-sampled continuous image frames are used as an input of the present invention.

Before the image frame is input into the feature extraction network, the picture size needs to be modified and cut to meet the input requirements of the network. Specifically, 2D keypoints of each image frame can be detected through a 2D keypoint detection algorithm, and then the person is zoomed to a fixed pixel size according to the 2D keypoint position and then cropped. Common 2D key point detection algorithms include openposition, alphaposition, etc., but the present invention is not limited thereto.

The requirements of image preprocessing are related to a feature extraction network, and the image frames can be appropriately preprocessed based on the later used image feature extraction network.

And step S120, carrying out image feature extraction on the input image frames by using an image feature extraction network to obtain feature vectors corresponding to the image frames.

After the image frame is preprocessed, the image feature extraction network module performs an image feature extraction step, and as an example, the image feature extraction network (or called appearance feature extraction network) used by the image feature extraction network module is ResNet50, but the invention is not limited thereto. The dimensions of the input picture of the image feature extraction network are (H, W, C), H, W, C respectively represent the length, width and number of channels of the image frame, and the output of the image feature extraction network is, for example, a feature vector of 2048 dimensions.

And step S130, performing time sequence feature or space feature extraction on the feature vectors corresponding to the image frames by utilizing a space-time sequence feature extraction network in combination with an attention mechanism to obtain image feature vectors at different moments or different positions.

After the feature vectors corresponding to the image frames are obtained, for the image frames with time relevance, a time sequence feature extraction network module is further utilized to extract time sequence features; and for the image frames with spatial correlation, a time sequence feature is further extracted by utilizing a space-time sequence feature extraction module. In some embodiments of the present invention, the spatio-temporal sequence feature extraction network module adopts a lightweight recurrent neural network in combination with a self-attention mechanism as a spatio-temporal sequence feature extraction network to extract temporal features or spatial features of feature vectors of consecutive image frames. As an example, the lightweight Recurrent neural network may be a Single Recurrent Unit (SRU) network, but the present invention is not limited thereto.

The invention adopts a lightweight cyclic neural network, which considers that the traditional LSTM (long-short memory network) has large gating multiparameter quantity, more data are needed for training, overfitting is easy, the training difficulty is higher, and the gating updating mechanism of the LSTM depends on the cell state at the previous moment, so the parallelism is poorer, and the training and reasoning speed is slower. The invention considers that the action of a person at a certain moment t of continuous image frames and the actions before and after t have correlation, and the action at each moment is inferred from the hidden state, so the correlation has certain consistency, so the invention combines the attention mechanism to fully utilize the correlation between the frames, and the final hidden state at the moment t is represented as the weighted average of the hidden states at all the moments in a time window, and the attention mechanism obtains the weights at different moments by inputting the hidden states at different moments. Similarly, the attention mechanism can also make full use of the spatial correlation between frames to obtain the weights of different positions by inputting hidden states of different positions. In the embodiment of the present invention, a specific implementation manner of the attention mechanism is as follows:

the method comprises the steps of taking hidden states of different moments/positions output by a time-space sequence feature extraction network as an attention mechanism module input, obtaining 3 different projection data representations by calculating projection of the hidden states, wherein the 3 projection data respectively transform the input hidden states through 3 learnable parameter matrixes. The attention mechanism module relates to 3 matrixes, namely a Q (Query) feature matrix, a K (Key) feature matrix and a V (Value) feature matrix, obtains a weight coefficient of the Value feature matrix corresponding to the Key feature matrix by calculating the similarity or correlation of the Query feature matrix and the Key feature matrix, and then performs weighted summation on each element of the Value feature matrix to obtain a final attention Value.

In the embodiment of the invention, two projection data representations are selected to be respectively used as a Query feature matrix and a Key feature matrix to calculate the correlation, and the correlation calculation modes can be various: the correlation (Similarity) between the projection data representations can be calculated by inner product, i.e.:

Similarity＝(x,y)；

wherein, x and y are respectively a Query matrix and a Key matrix.

After the correlations at different times are obtained, the similarity is further normalized to make the weight sum of the hidden states at different times 1.

After the weights are obtained, the third projection data representation of the hidden state projection is weighted by the weights, and is used as the output of the attention mechanism module, and is also used for finally returning the data representation of the 3D human body key point information.

The lightweight spatio-temporal sequence feature extraction network combined with the attention mechanism can extract the time sequence correlation or the space correlation among a plurality of input image frames. The light-weight time-space sequence feature extraction network can enable the extraction of time sequence features/space features to be more sufficient, so that the jitter degree of an estimation result is reduced, the numerical measurement standard of the jitter degree is the acceleration error of the estimation result, namely, the acceleration error of the estimation result is reduced, and the purpose of reducing the jitter degree of the estimation result is achieved. The lightweight spatio-temporal sequence feature extraction network of the embodiment of the invention can also be combined with other existing models, and is suitable for various 3D attitude estimation methods.

Step S140, inputting the picture feature vectors at different moments/positions into a regression model comprising a multilayer perceptron model to obtain estimation results at all moments, wherein the estimation results at all moments comprise three-dimensional information of each key point at all moments.

After the output of the attention mechanism module is obtained, the human body key point 3D information of each frame can be further estimated through a regression module. More specifically, the regression module may use a multi-layer perceptron (multi-layer perceptron) to obtain a mapping from each time/position hidden state final data representation to each time/position human body parameter through learning by using a multi-layer neural network. For a multilayer perceptron model, a 3-layer network structure is specifically adopted in the invention, wherein a fully-connected network of two 1024 neurons serves as a hidden layer and an output layer, and the outputs of the two hidden layers need to pass through a dropout (random inactivation) layer to prevent overfitting, so that the generalization performance of the network is improved.

After the space-time sequence feature extraction network, the final representation of the data is obtained, and the regression network of the regression module obtains the estimation result of each moment, namely the 3D information of each key point at the moment by inputting the data representation of each moment.

In the training process of the 3D form and posture estimation model, the estimation result is further used as the input of a discriminator, and the generator network structure is optimized based on the discrimination result of the discriminator.

Before inputting to the discriminator, the obtained three-dimensional information of the key points needs to be preprocessed, and as the three-dimensional information of the key points, namely the 3D rotation parameter estimation value, is obtained by estimation, the 3D coordinates of the key points can be obtained through a parameterized human body Model SMPL (Skinned Multi-Person Linear Model), and then the 3D coordinates are used as the input of the discriminator. The parameterized human body model SMPL provides a method for simulating the body surface morphology of human body posture images, and the method can simulate the bulges and depressions of human muscles in the limb movement process, so that the surface distortion of the human body in the movement process can be avoided, and the morphology of the muscle stretching and contraction movement of the human body can be accurately described.

The same operation is also required to be performed on the true value of the three-dimensional information of each key point in the data set, and the true value is converted into a 3D coordinate and then used as the input of the discriminator. Fig. 4 is a schematic diagram of the working principle of the discriminator based on the graph neural network in the embodiment of the present invention. As shown in fig. 4, the 3D coordinate calculation module calculates the human body key point 3D coordinate estimated value and the human body key point 3D coordinate true value by using the parameterized human body model for the human body key point 3D rotation parameter estimated value and the human body key point 3D rotation parameter true value, respectively.

As shown in fig. 4, the discriminator adopts a neural network structure, and can capture time sequence information of a time-space sequence and human body topological structure information at the same time.

Inputting the estimated value of the 3D coordinates of the key points of the human body as an estimated space-time sequence into a discriminator to obtain a score of an estimated result in a corresponding time window, namely an estimated space-time sequence score; meanwhile, real space-time sequences (human body key point 3D coordinate true values) with the same time length are also used as the input of the discriminator, scores for one real space-time sequence are correspondingly obtained, and the values of two loss functions can be calculated through the two scores.

The output of the discriminator is to evaluate the estimation result and grade of the real value in the data set, the discriminator and the generator are supervised by utilizing the grade, the grade height represents the distance between the input of the discriminator and the distribution of the true value, and the grade height represents the distance between the discriminator and the true value;

the loss function of the discriminator is the sum of the distance between the fraction of the estimated value and the minimum value of the fraction and the distance between the fraction of the real value and the maximum value of the fraction; the loss function that provides supervision for the generator is then the distance between the estimate score and the maximum of the score.

As an example, the two loss functions are:

wherein, theta_fakeRepresenting the rotation parameter sequence estimate output by the generator,

representing the output value of the discriminator with the rotation parameter sequence estimation value as the input of the discriminator,

indicating expectation of the discriminator output, theta_realRepresenting the true rotation parameter sequence value, D (theta) in the data set_real) Representing the output value of the discriminator with the true rotation parameter series value as the input of the discriminator,

indicating that the arbiter output is desired. These two penalty functions may be part of the penalty function L _ dis for the arbiter, and the generator penalty function L _ gen, respectively. Wherein the loss function of the generator represents the fraction of the estimated result of the generator obtained by the discriminatorThe loss function of the discriminator is divided into two parts as high as possible (i.e. close to the real sample), the first part is that the real sample needs to get a higher score through the discriminator, and the score of the result estimated by the generator should be as low as possible, so that the discriminator can realize the function of distinguishing the generated sample from the real sample. And performing back propagation based on the calculated loss function of the arbiter and the loss function of the generator, thereby realizing antagonistic learning.

The network structure of the discriminator in the embodiment of the invention is greatly different from the network structure of the conventional discriminator for generating the countermeasure network, and because the task of generating the countermeasure network is picture generation in the prior art, the output of the generator and the input of the discriminator are RGB pictures, which are different from the human body posture sequence of the task, namely, the joint points of the human body rotate in 3 dimensions at each moment, the conventional method can not be directly applied to the task. The network structure of the discriminator in the embodiment of the invention takes the estimation result sequence (estimation space-time sequence) as input and the truth value space-time sequence as input to respectively obtain the score of the estimation result sequence and the score of the real space-time sequence. And calculating the loss function of the discriminator and the loss function of the generator respectively by using the two scores, and then performing back propagation to realize antagonistic learning, thereby reducing the error of the estimation result.

After the first estimation result is obtained and before the post-processing is performed, in the embodiment of the present invention, it is further preferable that a rotational parameter smoothing method is first used to perform a pre-smoothing operation, where the pre-smoothing operation is performed in a form of filtering the first estimation result, and the filtering may be performed in a simple moving average, low-pass filter or kalman filtering manner, depending on the requirements of the application on processing delay and complexity of the method. The estimation parameter smoothing operation of the embodiment of the invention can smooth the estimation result, the jitter degree of the estimation result after the smoothing processing is reduced, and the numerical measurement standard of the jitter degree is the acceleration error of the estimation result, so the speed error of the estimation result can be reduced by the rotating parameter smoothing method of the invention. That is, in combination with the aforementioned lightweight spatio-temporal sequence feature extraction network, the smoothing process can further reduce the acceleration error of the estimation result, and improve the debounce effect.

The output estimation value needs to be post-processed once by a post-processing module, whether the estimation result meets the prior value or not is further judged after the processing is finished, the judgment can be carried out through a prior threshold T, and if the estimation result is still larger than the prior threshold T, the post-processing needs to be continued.

The post-processing process needs the 2D key point estimation result of a person as input, and the 2D key point can be obtained through the existing deep learning model. Human body key point information can be constrained by using a priori condition specified by human. More specifically, the objective function may be minimized by using a numerical optimization method, where the objective function may include a 2-norm term of the 2D keypoint error, a gaussian mixture model prior term of the motion parameter, and a motion prior term specified for some person, and the numerical optimization algorithm may use an L-BFGS algorithm to adjust the input original motion parameter estimation value (first estimation result) to better conform to the motion prior, so as to obtain the post-processed human keypoint information. After the human body key point information after post-processing is obtained, whether the processed result is smaller than the prior threshold value T needs to be judged again, if the processed result is not smaller than the prior threshold value T, the result after the last processing is used as input, the optimization process is repeated until the processed result is smaller than the prior threshold value T, and the human body key point information obtained after the processing is finished is a second estimation result.

The post-processing operation shown in fig. 4 in the embodiment of the present invention can deal with the self-occlusion problem of the body, and the error of the estimation result under the occlusion condition of the object can be reduced, which is lower than that of the existing method. The method combines the motion prior knowledge, can automatically correct larger errors appearing in the estimation result, and can be combined with a method for smoothing the rotation parameters. The post-processing method provided by the embodiment of the invention can optimize the estimation result of any three-dimensional human body posture, and reduce the estimation error under the shielding condition.

In some embodiments of the present invention, preferably, the modified estimation result is further subjected to a smoothing operation. For the smooth operation method of the rotation parameters, the input is the corrected three-dimensional information, and the output is the information of the final human body key points after smoothing. The smoothing operation and the previous pre-smoothing operation on the first estimation result may be kept consistent, that is, a form of filtering the estimation result is adopted, and the filtering may be a simple moving average, a low-pass filter or a kalman filtering, depending on the requirements of the application on the processing delay and the complexity of the method.

As described above, the 3D form and posture estimation method based on spatio-temporal correlation images according to the embodiments of the present invention can reduce the error of the estimation result and the acceleration error of the estimation result when an object is occluded, thereby reducing the jitter degree of the estimation result. The existing 3D form and posture estimation method has no good guarantee on the jitter degree of the estimation result, and the estimation result has the jitter problems of different degrees, thereby seriously influencing the impression. The present invention can solve the problems in the prior art well.

The method is applicable to the fields including but not limited to human bodies, and the 3D posture estimation of animals can also adopt the method provided by the invention. The method comprises the steps of inputting a motion video or continuous image frames of an animal, and regressing the 3D information of animal key points through a feature vector obtained by an image feature extraction network, thereby realizing the estimation of the 3D posture of the animal. Due to the particularity of animal morphology, the occlusion problem is relatively more serious than that of human beings, and for the estimation of animal pose, post-processing is more necessary to perform post-processing on the estimation result by utilizing post-processing.

Correspondingly to the method, the invention also provides a 3D form and posture estimation device based on the spatiotemporal correlation image, which comprises a processor and a memory, wherein the memory is used for storing computer instructions, the processor is used for executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the device realizes the steps of the edge computing server deployment method.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the foregoing steps of the edge computing server deployment method. The computer readable storage medium may be a tangible storage medium such as an optical disk, a U disk, a floppy disk, a hard disk, and the like.

It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions or change the order between the steps after comprehending the spirit of the present invention.

Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link.

It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.

Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A3D form and posture estimation method based on space-time correlation images is characterized by comprising the following steps:

2. The method of claim 1, further comprising:

obtaining the three-dimensional coordinate of each key point by the three-dimensional information of each key point through a parameterized evaluation model, and respectively taking the obtained three-dimensional coordinate of each key point and the three-dimensional coordinate of each key point obtained based on the truth value of the three-dimensional information in the data set as the input of a discriminator to respectively obtain an estimation result score and a real space-time sequence score;

and respectively using the obtained estimation result score and the real space-time sequence score as a loss function of a generator and a loss function of a discriminator to perform back propagation so as to realize antagonistic learning.

3. The method according to claim 1, wherein the performing spatio-temporal sequence feature extraction on the feature vectors corresponding to the image frames by using a spatio-temporal sequence feature extraction network in combination with an attention mechanism to obtain the picture feature vectors at different times or positions comprises:

the hidden states of different moments or different positions output by the time-space sequence feature extraction network are used as the input of an attention mechanism module, 3 different projection data representations are obtained by calculating the projection of the hidden states, and the 3 different projection data representations are used for transforming the input hidden states through 3 learnable parameter matrixes;

calculating the correlation of different moments or different positions by utilizing the first projection data representation and the second projection data representation, and calculating the weight of hidden states of different moments or different positions based on the correlation;

the third projection data representation is weighted with the calculated weights as an output of the attention mechanism module.

4. The method of claim 1, further comprising:

performing data preprocessing on an initial image frame before inputting a picture or an image frame with time or spatial correlation, wherein the data preprocessing comprises the following steps:

performing frame cutting operation on the initially obtained image frame to obtain a picture of each frame;

down-sampling each frame of picture to obtain a first image frame;

the 2D key points of each first image frame are detected, and the obtained first continuous image frame is cropped to a fixed pixel size based on the positions of the 2D key points of each first image frame to obtain an image frame including a subject.

5. The method of claim 1, further comprising post-processing the three-dimensional information for each keypoint at each time or location, the post-processing comprising:

obtaining an estimation result of a 2D key point in an image frame;

and constraining the three-dimensional information of the key points by utilizing the 2D key point estimation result and a preset prior condition, and minimizing an objective function through numerical optimization.

6. The method of claim 5, further comprising:

before post-processing the three-dimensional information of each key point at each moment or position, pre-smoothing the three-dimensional information of each key point in a mode of filtering an estimation result; and/or

And after post-processing the three-dimensional information of each key point at each moment, pre-smoothing the three-dimensional information of each key point in a mode of filtering the estimation result.

7. The method of claim 2,

the loss function of the discriminator is the sum of the distance between the fraction of the estimated value and the minimum value of the fraction and the distance between the fraction of the real value and the maximum value of the fraction;

the loss function that provides supervision for the generator is then the distance between the estimate score and the maximum of the score.

8. The method of claim 5, wherein the objective function comprises a 2-norm term of the 2D keypoint error, a Gaussian mixture model prior term of motion parameters, and a pre-specified motion prior term;

the numerical optimization algorithm adopts an L-BFGS method to adjust the input original motion parameter estimation value, so that the original motion parameter estimation value is more consistent with the original prior, and the key point information of the evaluated body after post-processing is obtained.

9. An apparatus for 3D pose and pose estimation based on spatiotemporal correlated images, comprising a processor and a memory, characterized in that the memory has stored therein computer instructions for executing the computer instructions stored in the memory, the apparatus implementing the steps of the method according to any one of claims 1 to 8 when the computer instructions are executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.