CN111539981B

CN111539981B - Motion prediction system based on artificial intelligence

Info

Publication number: CN111539981B
Application number: CN202010286652.1A
Authority: CN
Inventors: 王田; 李泽贤; 刘洲阳; 单光存; 吴淮宁
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2023-03-10
Anticipated expiration: 2040-04-13
Also published as: CN111539981A

Abstract

The invention discloses an artificial intelligence-based motion prediction system, which extracts skeleton information from a video through a skeleton extraction module, predicts the motion state of a skeleton by using the skeleton prediction module comprising a recurrent neural network, and finally obtains a predicted picture by combining a predicted skeleton picture and an original picture, thereby quickly and accurately obtaining a more reasonable motion prediction result.

Description

Motion prediction system based on artificial intelligence

Technical Field

The invention relates to the technical field of human body action prediction, in particular to a motion prediction system based on artificial intelligence.

Background

The human body movement has the characteristics of complexity, variability and flexibility, the prediction of the human body action is the research direction of a target comparison hotspot, and corresponding protective measures are formulated and taken by predicting the action which may occur to the human body, so that the method has very important application value.

In the existing motion prediction scheme, the problems of semantic information deficiency and instruction information lack often occur, which can cause the problems that the generated picture is not in accordance with the real logic, the picture is fuzzy, the picture lacks continuity and the like.

For the above reasons, the present inventors have made intensive studies on the existing human motion prediction system, and have awaited designing an artificial intelligence-based motion prediction system capable of solving the above-mentioned problems.

Disclosure of Invention

In order to overcome the problems, the inventor of the present invention has made intensive studies to design a motion prediction system based on artificial intelligence, which extracts skeleton information from a video through a skeleton extraction module, predicts the motion state of a skeleton using a skeleton prediction module including a recurrent neural network, and finally obtains a predicted picture by combining a predicted skeleton map with an original picture, so that a more reasonable motion prediction result can be obtained quickly and accurately, and the present invention has an extremely high engineering application value.

Specifically, the invention aims to provide an artificial intelligence-based motion prediction system, which comprises a skeleton extraction module 1 and a skeleton prediction module 2;

the skeleton extraction module 1 is used for calling/receiving continuous original pictures, and extracting skeleton points of a human body on each frame of picture from the continuous original pictures to obtain a skeleton picture;

the skeleton prediction module 2 is configured to receive the skeleton map obtained by the skeleton extraction module 1, and predict a predicted skeleton map at a subsequent time according to the received skeleton map.

Wherein the consecutive original pictures refer to: arranging each frame of picture in a video to be processed according to a time sequence to obtain a picture set, preferably, the video is a video containing human body actions; and/or

The frame number of the skeleton image extracted by the skeleton extraction module is consistent with the frame number of the original pictures forming the video, namely, each frame of original picture corresponds to one frame of skeleton image.

The skeleton extraction module extracts skeleton points of a human body in an original picture through an alpha system to obtain a skeleton map.

The framework prediction module 2 is configured to predict, according to the received continuous N-frame framework maps, a continuous M-frame prediction framework map after T time;

and T represents the time point of shooting the picture corresponding to the last frame of skeleton image in the continuous N frames of skeleton images.

Wherein, the quantity relationship between N and M can be 3;

preferably, the value of N is 100-200;

the value of M is 50-100.

The framework prediction module 2 comprises a cycle network neural framework formed by gated cycle units, and learns the framework motion rule and completes prediction through sample training.

Wherein the artificial intelligence based motion prediction system further comprises an image generation module 3,

the image generation module 3 is used for generating a prediction image according to the original image and the prediction skeleton image obtained by the skeleton prediction module 2;

preferably, the generated prediction pictures are continuous and correspond to the prediction skeleton map obtained by the skeleton prediction module 2 one by one.

Wherein the image generation module 3 comprises a rough contour generation sub-module 31 and a detail compensation sub-module 32;

the rough contour generation submodule 31 is configured to generate a corresponding rough human body contour map for each frame of the predicted skeleton map;

the detail compensation sub-module 32 is configured to fill details in the rough human body contour map to obtain a predicted picture corresponding to the predicted skeleton map.

The invention also provides a motion prediction method based on artificial intelligence, which comprises the following steps:

extracting skeleton points of the human body on each frame of picture from continuous original pictures to obtain a skeleton picture,

predicting a predicted skeleton map at a subsequent moment according to the skeleton map,

and generating a prediction picture according to the original picture and the prediction skeleton picture.

During the process of generating the prediction picture, firstly, the attitude mask is produced through the prediction skeleton picture, and then the prediction picture is generated.

The invention has the advantages that:

(1) According to the motion prediction system based on artificial intelligence, provided by the invention, the skeleton information is extracted from the video, the motion state of the skeleton is predicted by using the recurrent neural network, and finally, the predicted skeleton image and the original image are used for generating the image, so that the clear predicted image and the video are obtained.

(2) According to the motion prediction system based on artificial intelligence, provided by the invention, the work of extracting skeleton points and predicting a skeleton map can be finished in real time, and the task requirements of implementing prediction and processing can be met;

(3) The artificial intelligence-based motion prediction system provided by the invention can also provide visual prediction pictures, which is convenient for feedback verification and can also increase the operability of the system.

Drawings

FIG. 1 illustrates an overall logic block diagram of an artificial intelligence based motion prediction system provided in accordance with the present invention;

FIG. 2 is a schematic diagram of a network model structure of a cycle network structure in the artificial intelligence-based motion prediction system provided in accordance with the present invention;

FIG. 3 is a schematic diagram illustrating a network structure of a skeleton prediction module in the whole artificial intelligence-based motion prediction system provided by the invention;

FIG. 4 shows an original picture in an experimental example;

FIG. 5 shows a skeleton diagram in an experimental example;

FIG. 6 shows a predicted skeleton map in an experimental example;

FIG. 7 shows a pose mask layout in an experimental example;

fig. 8 shows a prediction picture in an experimental example.

The reference numbers illustrate:

1-framework extraction Module

2-framework prediction module

3-image generating module

31-coarse Profile Generation submodule

32-detail compensation submodule

Detailed Description

The invention is explained in more detail below with reference to the figures and examples. The features and advantages of the present invention will become more apparent from the description.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

According to the invention, the artificial intelligence-based motion prediction system comprises: a skeleton extraction module 1 and a skeleton prediction module 2.

In a preferred embodiment, the skeleton extraction module is configured to retrieve/receive consecutive original pictures, and extract skeleton points of a human body on each frame of picture from the consecutive original pictures, respectively, to obtain a skeleton map.

The continuous original pictures refer to: and arranging each frame of picture in the video to be processed according to a time sequence to obtain the picture set, preferably, the video is a video containing human body actions.

The frame number of the skeleton image extracted by the skeleton extraction module is consistent with the frame number of the original pictures forming the video, namely each frame of the original pictures corresponds to one frame of the skeleton image.

Because the original pictures forming the video are arranged in time sequence and are continuous, the obtained skeleton map is also arranged in time sequence and is continuous.

Preferably, the skeleton extraction module extracts skeleton points of a human body in the original picture through an open-source alpha system to obtain a skeleton map.

The Alphapose system adopts a top-down detection method, namely, all possible human body frames in an image are detected firstly, and then the posture of a human body is detected in each area. In order to solve the problem of regression frame redundancy in the target detection process, an area multi-person posture detection framework (RMPE) is set, and in the framework, a Symmetric Spatial transform Network (Symmetric Spatial transform Network) is added on the basis of SPPE Stacked Hourglass, so that the area where a single human body is located is extracted from an insufficiently accurate regression frame; meanwhile, a parameterized attitude Non-Maximum-Suppression (Parametric Pose Non-Maximum-Suppression) and a novel attitude distance measurement scheme are adopted, so that the problem of attitude redundancy in the detection process is effectively solved; and finally, realizing data enhancement of training data by using a position-Guided Proposals Generator (position-Guided Proposals Generator), wherein the method simulates the generation process of a human body detection regression frame by relearning different posture information in an output result so as to generate a larger training set.

In an Alphapos system, in order to solve the problem that skeleton extraction is difficult for dense people, a global competition matching algorithm is adopted, so that the dependency of a skeleton detection model on a human regression frame is reduced, and the robustness of the model on a complex scene is improved. The model adopts a novel joint point loss function, a series of candidate skeleton key points are output for all detected human body regression frames, a multi-peak heat map is output by the control model, the possible positions of the joint points can be output as much as possible under the condition that the regression frames are inaccurate, then redundant results are eliminated through clustering operation, a sparse graph model representing the connection relation and probability of human body examples and the joint points is constructed, a globally optimal skeleton detection scheme is obtained through solving the optimal matching problem of the sparse graph model, and the defect that a two-step method lacks a global visual field is well overcome.

In addition, a method combining pixel replacement and convolution is used for upsampling in an Alphapose system to output a heat map of skeleton key points, and compared with a traditional deconvolution method and a bilinear difference method, the method has the advantages of being low in operation amount and avoiding a grid effect.

The skeleton extraction module works in real time, the working frequency of the skeleton extraction module is 20fps, and at most 20 skeleton maps can be extracted per second.

The number of extracted skeleton images required by the motion prediction system can be determined according to the video duration and the frame number, 100 frames of images in a 5s video can be used for predicting the next 40 frames of images in 2s, 100 frames of images extracted at the same time interval in 100s can be used for predicting the next 40 frames of images in 40s, and the sampling frequency is not fixed.

In a preferred embodiment, the skeleton prediction module 2 is configured to receive the skeleton map obtained by the skeleton extraction module 1, and predict a predicted skeleton map at a subsequent time according to the received skeleton map.

Specifically, the skeleton prediction module 2 is configured to predict, according to the received consecutive N-frame skeleton maps, consecutive M-frame predicted skeleton maps after T time;

and T represents the time point of shooting the picture corresponding to the last frame of skeleton map in the continuous N frames of skeleton maps.

The video formed by the original pictures corresponding to the M frames of predicted skeleton images is continuous with the video formed by the pictures corresponding to the N frames of skeleton images.

The quantitative relationship between N and M may be 3; the values of M and N have no definite upper limit, but if the value of M is too large, the effect may be poor, and in order to ensure the accuracy of prediction, the value of M generally does not exceed 100, and similarly, in order to ensure that there is a sufficient amount of basic data and to ensure the accuracy of prediction, the value of N is generally greater than 50.

Preferably, the value of N is 100-200;

the value of M is 40-100.

Preferably, the skeleton prediction module 2 includes a loop network neural architecture composed of gated loop units, and the module deeply learns the skeleton motion law and completes prediction through sample training.

Preferably, during the training of the skeletal prediction module 2:

step 1, establishing a neural network model mainly composed of a convolutional layer and a pooling layer, wherein each layer has a plurality of parameters and the parameters are randomized;

and 2, inputting the marked data into the model, and updating the parameters of each layer of the model by utilizing inverse gradient propagation according to the designed loss function.

Finally, the iteration process of the step 2 is repeatedly carried out until the model meets the requirements;

preferably, using the Adam optimizer, 2000 and 10000 iterations are performed, respectively, with a learning rate of 0.001 and 32 sets of data per batch of training.

In a more preferred embodiment, the network model of the recurrent network structure is trained, and the prediction sequence is generated by using the result of prediction as an input to the decoder, so that the repeated parameter adjustment can be avoided, and the recurrent neural network can be quickly recovered from the local error, so that the local optimum point can be avoided, and the prediction error in a short time can be reduced. As shown in fig. 2, in the network structure, the inventor finds that a simpler loop network structure is sufficient for a framework prediction task, so that a gated loop unit (GRU) module with low computational overhead is selected as a main component of prediction in the present application.

The gate controlled circulation unit (gated recovery unit) can simplify a forgetting gate, an input gate and an output gate in a long-time and short-time network into an updating gate and a resetting gate, wherein the updating gate is used for adjusting how much state information of the previous moment is transmitted to a new state, when the value of the updating gate is larger, more state information can be substituted, the resetting gate is used for adjusting to ignore the state information of the previous moment, and when the value of the resetting gate is larger, only a small part of information can be ignored.

The schematic diagram of the network structure of the circulation network is shown in fig. 3, and the calculation formula of the network structure of the circulation network is shown as the following formula:

z _t ＝σ(W _z ·[h _t-1 ·x _t ])

r _t ＝σ(W _r ·[h _t-1 ，x _t ])

wherein, z _t A presentation update gate for presenting the degree to which the hidden state information and the current input were substituted into the current state at the previous moment, i.e. deciding which information to discard and which new information to add, r _t Representing a reset gate for controlling the degree of forgetting an implicit information at the last moment, sigma representing an activation function, W _z Weight parameter, h, representing the update door _t-1 Represents the last moment in the hidden layer output, x _t Represents input, W _r Weight parameter representing reset gate, W representing inference

The weight parameter of (a) is set,

intermediate state representing the acquisition of current state information, h _t Representing the current state information. In a preferred embodiment, the artificial intelligence based motion prediction system further comprises an image generation module 3; the image generation module 3 is used for generating a prediction picture according to the original picture and the prediction skeleton map obtained by the skeleton prediction module 2.

Preferably, the generated prediction pictures are continuous and correspond to the prediction skeleton map obtained by the skeleton prediction module 2 in a one-to-one manner.

Specifically, the image generation module 3 generates M predicted pictures according to N original pictures corresponding to N skeleton pictures and M predicted skeleton pictures obtained through the N skeleton pictures and consecutive after T, where each predicted picture corresponds to one predicted skeleton picture.

Preferably, the image generation module 3 comprises a rough contour generation sub-module 31 and a detail compensation sub-module 32.

The rough contour generation submodule 31 is configured to generate a corresponding rough human body contour map for each frame of the predicted skeleton map or the pose mask, where the rough human body contour map focuses on capturing structural information of a human body in the video, and the rough human body contour map has a basic human body contour, but details (such as clothes texture) are not clear enough, and a background portion is blurred.

The rough contour generation submodule 31 completes generation of a rough human body contour map specifically through the following substeps:

substep 1, integrating convolution kernel information of an original picture and a predicted skeleton picture into an appearance information layer by stacking convolution layers;

substep 2, integrating the information obtained by the convolutional layer in substep 1 through a full connection layer and exchanging the information;

and 3, forming a decoder by a group of stacked and symmetrical convolution layers to generate a corresponding rough human body outline map.

Preferably, the original picture used in sub-step 1 may be any one of the original pictures corresponding to the M-frame predicted skeleton map, and more preferably, the last one. The predicted skeleton map used in sub-step 1 may be any one of N-frame skeleton maps, but sub-step 1, sub-step 2, and sub-step 3 described above need to be performed separately for each frame in the N-frame skeleton map. The convolution kernel is a convolution kernel of 3 x 3, and information of 9 pixels is integrated each time in substep 1.

Preferably, a basic module of a residual network is provided in the coarse contour generation submodule 31 to improve the generation performance, i.e. the encoder and decoder are skipped and part of the output is directly connected to the input to propagate the image internal information.

The encoder of the coarse contour generation sub-module 31 is composed of five residual blocks and a full-link layer, each residual block is composed of two convolution layers with step length of 1 and a sub-sampling convolution layer with step length of 2, and all convolution layers are composed of convolution kernels of 3 × 3.

The encoder of the detail compensation sub-module 32 is composed of three residual blocks, each of which is composed of two convolution layers with step size of 1 and a sub-sampling convolution layer with step size of 2, and all convolution layers are composed of convolution kernels of 3 × 3.

The rough contour generation submodule 31 takes the original picture and the target skeleton as input to generate a rough human body contour map; the detail compensation sub-module 32 takes the original picture + the rough human body outline map as input, and refines the details to generate an appearance difference map. And finally, combining the fuzzy result graph with the appearance difference graph to generate a final result graph. The network structure of the detail compensation sub-module 32 removes the fully connected layer of compressed input information, as compared to the network structure of the coarse contour generation sub-module 31, thereby helping to preserve more detail in the input, the generator of the detail compensation sub-module 32 taking the original picture as input. Meanwhile, the convergence process of the model training is accelerated in the detail compensation sub-module 32 by using the appearance difference map, because the detail compensation sub-module 32 is based on the reasonable result generated by the rough contour generation sub-module 31, and the detail compensation sub-module 32 focuses on learning the missing appearance details rather than synthesizing the target image, so that the rough human body contour map is supplemented and modified into the prediction picture corresponding to the prediction skeleton map by the detail compensation sub-module 32. The appearance difference map does not have formed human body/background information, and only supplements and corrects textures and the like.

In a preferred embodiment, the training process of the rough contour generation sub-module 31 and the detail compensation sub-module 32 is substantially the same as the training process of the skeleton prediction module 2, and corresponding samples are given for different modules.

In a preferred embodiment, in the process of generating a corresponding rough human body contour map for each frame of the predicted skeleton map, the rough contour generation sub-module 31 connects the skeleton key points of the human body according to the human engineering principle on the basis of obtaining 18 skeleton key points by skeleton detection, and coats the corresponding human body trunk parts to form a posture mask after being connected into one piece; the addition of the pose mask causes the foreground portion where the human body is located to be given more weight than the background. The inventor finds that the human texture generated after the posture mask is added is more detailed and clear, and the details are more accurate. The pose mask is set to 1 for the foreground and 0 for the background, and by connecting body parts and applying morphological operations, it is enabled to cover substantially the whole body in the target image.

The invention also provides an artificial intelligence based motion prediction method, which is realized by the artificial intelligence based motion prediction system, and comprises the following steps:

extracting the skeleton points of the human body on each frame of picture from the continuous original pictures through a skeleton extraction module 1 to obtain a skeleton picture,

the predicted skeleton map at the subsequent moment is predicted according to the skeleton map by the skeleton prediction module 2,

and then a prediction picture is generated according to the original picture and the prediction skeleton picture through the image generation module 3.

In the process of generating the prediction picture, firstly, a posture mask is generated through the prediction skeleton picture, then a rough human body outline picture is generated according to the posture mask, and finally, details are filled in the rough human body outline picture to obtain the prediction picture.

Preferably, the predicted skeleton map obtained by the skeleton prediction module 2 includes 18 skeleton key points, and before the predicted skeleton map is used to generate a predicted picture, the predicted skeleton map is processed to obtain a posture mask, and the posture mask is used to generate the predicted picture; and connecting 18 skeleton key points in the predicted skeleton diagram according to the human engineering principle, coating corresponding human trunk parts, and connecting into a piece to obtain the posture mask.

Experimental example:

extracting skeleton points of a human body on each frame of picture from continuous original pictures through a skeleton extraction module to obtain a skeleton picture;

the skeleton extraction module extracts skeleton points of a human body in an original picture through an alpha phase system, the number of the continuous original pictures is 100, the last picture in the continuous original pictures is shown in fig. 4, 16 different human body images are shown in the picture, and each picture contains 16 human body images. Extracting 100 frames of skeleton images from the 100 frames of original pictures, wherein the skeleton image corresponding to the last frame of original picture is shown in fig. 5, and 16 corresponding human skeleton images are shown in fig. 5;

predicting continuous 40 frames of predicted skeleton maps at subsequent time according to the 100 frames of skeleton maps by a skeleton prediction module;

the framework prediction module learns the framework motion rule through sample training and completes prediction; the last frame picture in the prediction skeleton map is shown in fig. 6, and 16 corresponding prediction skeleton maps are shown in fig. 6;

processing the 40 frames of predicted skeleton maps to obtain 40 frames of attitude masked maps;

connecting 18 skeleton key points in each frame of predicted skeleton diagram according to the human engineering principle, coating corresponding human body parts, and connecting into a piece to obtain the posture mask; the final frame of the pose montage is shown in FIG. 7;

generating a prediction picture according to the original picture and the posture masking image,

generating 40 frames of rough human body contour maps according to the attitude mask, and filling details on the 40 frames of rough human body contour maps to obtain a prediction picture; the last frame prediction pictures corresponding to fig. 6, 7 are as shown in fig. 8.

According to the experimental example, the artificial intelligent motion prediction method provided by the invention can predict the human motion trend at the subsequent moment and obtain a clearer image result.

The present invention has been described above in connection with preferred embodiments, which are merely exemplary and illustrative. On the basis of the above, the invention can be subjected to various substitutions and modifications, and the substitutions and the modifications are all within the protection scope of the invention.

Claims

1. An artificial intelligence-based motion prediction system is characterized by comprising a skeleton extraction module (1) and a skeleton prediction module (2);

the skeleton extraction module (1) is used for calling/receiving continuous original pictures, and then extracting skeleton points of a human body on each frame of picture from the continuous original pictures respectively to obtain a skeleton picture;

the skeleton prediction module (2) is used for receiving the skeleton diagram obtained by the skeleton extraction module (1) and predicting a predicted skeleton diagram at a subsequent moment according to the received skeleton diagram;

the framework prediction module (2) is used for predicting continuous M frame prediction framework images after T time according to the received continuous N frame framework images;

the T represents the time point of shooting a picture corresponding to the last frame of skeleton map in the continuous N frames of skeleton maps;

the framework prediction module (2) comprises a cycle network neural framework formed by gate control cycle units, and learns the framework motion rule through sample training and completes prediction;

when a network model of a cycle network structure is trained, taking a result obtained by prediction as the input of a decoder to generate a prediction sequence;

the artificial intelligence based motion prediction system further comprises an image generation module (3),

the image generation module (3) is used for generating a prediction picture according to the original picture and a prediction skeleton picture obtained by the skeleton prediction module (2);

the image generation module (3) comprises a rough contour generation sub-module (31) and a detail compensation sub-module (32);

the rough contour generation submodule (31) is used for generating a corresponding rough human body contour map aiming at each frame of the predicted skeleton map; the encoder of the rough contour generation submodule (31) consists of five residual blocks and a full connection layer, each residual block consists of two convolution layers with the step length of 1 and a subsampled convolution layer with the step length of 2, and all the convolution layers consist of convolution kernels of 3 multiplied by 3;

the detail compensation submodule (32) is used for filling details on the rough human body outline map to obtain a prediction picture corresponding to the prediction skeleton map;

the encoder of the detail compensation submodule (32) is composed of three residual blocks, each residual block is composed of two convolution layers with the step length of 1 and a subsampled convolution layer with the step length of 2, and all the convolution layers are composed of convolution kernels of 3 x 3.

2. The artificial intelligence based motion prediction system of claim 1,

the continuous original pictures refer to: arranging each frame of picture in a video to be processed according to a time sequence to obtain a picture set, wherein the video is a video containing human body actions;

3. The artificial intelligence based motion prediction system of claim 1,

the skeleton extraction module (1) extracts skeleton points of a human body in the original picture through an alpha position system to obtain a skeleton map.

4. The artificial intelligence based motion prediction system of claim 1,

the quantity relationship between N and M is 3;

the value of M is 50-100.

5. The artificial intelligence based motion prediction system of claim 1,

the generated prediction pictures are continuous and correspond to the prediction skeleton pictures obtained by the skeleton prediction module (2) one by one.