CN117392588A

CN117392588A - Motion gesture prediction method, device, equipment and computer readable storage medium

Info

Publication number: CN117392588A
Application number: CN202311507146.0A
Authority: CN
Inventors: 邓博文; 王晓茹; 曲昭伟; 刘明时; 马晨阳; 余龙龙; 卞德昕; 李梅芳
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-01-12

Abstract

The application discloses a motion gesture prediction method, a motion gesture prediction device, motion gesture prediction equipment and a computer readable storage medium, and is applied to the technical field of data processing. Preprocessing a to-be-processed motion video of a target individual needing to predict a motion gesture to obtain to-be-processed motion video data, and carrying out feature extraction and fusion on the to-be-processed motion video data by utilizing a multi-modal feature extraction and fusion network to obtain multi-modal features, wherein the multi-modal features are key point sequence features, motion speed difference features and fusion features of motion light shadow features of the target individual; and inputting the multi-mode characteristics into a motion gesture prediction network to obtain a motion prediction result output by the motion gesture prediction network. The multi-modal feature has less noise information, and can accurately reflect the characteristics of the target individual in motion during complex application scenes and medium-long term prediction, so that the motion gesture prediction is performed based on the multi-modal feature, and the prediction accuracy of the motion gesture of the target individual can be improved.

Description

Motion gesture prediction method, device, equipment and computer readable storage medium

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to a motion gesture prediction method, apparatus, device, and computer readable storage medium.

Background

The gesture of the target individual plays an important role in the process of interaction with the outside, accurately predicts the motion gesture of the target individual, and has very wide application prospect in various fields such as automatic driving, interpersonal interaction, computer animation, robot technology and the like.

At present, a cyclic neural network, a convolutional neural network, a generation countermeasure network, a graph convolution neural network and other deep learning network models can be used for realizing the prediction of the motion gesture of a target individual. The input data of the method is single motion video or a key point sequence of a target individual extracted from the motion video, the motion video contains noise information irrelevant to motion prediction, the key point sequence of the target individual has missing, repeated and inaccurate data, the prediction result under a complex application scene is not satisfactory, and the problem of lower prediction accuracy of the motion gesture of the target individual exists.

Disclosure of Invention

In view of this, the present application provides a motion gesture prediction method, apparatus, device, and computer-readable storage medium, which can improve the accuracy of motion gesture prediction of a target individual.

In order to solve the problems, the technical scheme provided by the application is as follows:

in a first aspect, the present application provides a method for predicting a motion gesture, the method comprising:

in response to acquiring a motion video to be processed, preprocessing the motion video to be processed to obtain motion video data to be processed, wherein the motion video to be processed comprises a target individual needing to predict motion gesture;

performing feature extraction and fusion on the motion video data to be processed by utilizing a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features, wherein the multi-modal features are key point sequence features of the target individual, motion speed difference features of the target individual and fusion features of motion light and shadow features;

and inputting the multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network.

In one possible implementation manner, the feature extraction and fusion of the motion video data to be processed by using a pre-built multi-modal feature extraction and fusion network to obtain multi-modal features includes:

extracting key point sequence characteristics of the target individual, movement speed difference characteristics of the target individual and movement light and shadow characteristics in the to-be-processed movement video data by utilizing the pre-constructed multi-mode characteristic extraction and fusion network;

and fusing the key point sequence characteristics of the target individual, the motion speed difference characteristics of the target individual and the motion light and shadow characteristics to obtain the multi-mode characteristics.

In one possible implementation, after obtaining the multi-modal feature, the method further includes:

inputting the multi-modal characteristics to an encoder layer to obtain overall key point multi-modal characteristics;

and inputting the overall key point multi-mode characteristics to a decoder layer to obtain local key point multi-mode characteristics.

In one possible implementation manner, the inputting the multi-modal feature into a pre-constructed motion gesture prediction network, to obtain a motion prediction result output by the pre-constructed motion gesture prediction network, includes:

and inputting the local key point multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network.

In a second aspect, the present application provides a motion gesture prediction apparatus, the apparatus comprising:

the preprocessing module is used for preprocessing the motion video to be processed in response to the acquisition of the motion video to be processed to obtain motion video data to be processed, wherein the motion video to be processed comprises target individuals needing to predict motion postures;

the feature extraction and fusion module is used for carrying out feature extraction and fusion on the motion video data to be processed by utilizing a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features, wherein the multi-modal features are key point sequence features of the target individual, motion speed difference features of the target individual and fusion features of motion light shadow features;

and the prediction module is used for inputting the multi-mode characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network.

In one possible implementation manner, the feature extraction and fusion module includes a feature extraction sub-module and a feature fusion sub-module:

the feature extraction submodule is used for extracting key point sequence features of the target individual, motion speed difference features of the target individual and motion light and shadow features in the motion video data to be processed by utilizing the pre-built multi-mode feature extraction and fusion network;

the feature fusion submodule is used for fusing the key point sequence feature of the target individual, the motion speed difference feature of the target individual and the motion light and shadow feature to obtain the multi-mode feature.

In one possible implementation, after obtaining the multi-modal feature, the apparatus further includes a coding sub-module and a decoding sub-module:

the encoding submodule is used for inputting the multi-mode characteristics to an encoder layer to obtain overall key point multi-mode characteristics;

the decoding submodule is used for inputting the overall key point multi-mode characteristics to a decoder layer to obtain the local key point multi-mode characteristics.

In one possible implementation manner, the prediction module is specifically configured to: and inputting the local key point multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network.

In a third aspect, the present application provides a motion gesture prediction apparatus, the apparatus comprising: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of predicting a motion pose as described in the first aspect above.

In a fourth aspect, the present application provides a computer readable storage medium storing instructions that, when executed on a device, cause the device to perform the method of predicting a motion pose as described in the first aspect above.

From this, this application has following beneficial effect:

the application provides a motion gesture prediction method, which comprises the steps of firstly, acquiring motion video data to be processed, and then preprocessing the motion video to be processed to obtain the motion video data to be processed, wherein the motion video to be processed comprises target individuals needing motion gesture prediction; secondly, carrying out feature extraction and fusion on the motion video data to be processed by utilizing a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features, wherein the multi-modal features are key point sequence features of the target individual, motion speed difference features of the target individual and fusion features of motion light and shadow features; and finally, inputting the multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network. Thus, the motion video data is processed in advance to extract the key point sequence feature, the motion speed difference feature and the motion light shadow feature of the target individual from the motion video data, and the three features are fused to obtain the multi-mode feature. The multi-modal feature has less noise information, and can accurately reflect the characteristics of the target individual in motion during complex application scenes and medium-long term prediction, so that the motion gesture prediction is performed based on the multi-modal feature, and the prediction accuracy of the motion gesture of the target individual can be improved.

The embodiment of the application also provides a device corresponding to the method, and the device has the same beneficial effects as the method.

Drawings

FIG. 1 is a schematic diagram of single input data provided in an embodiment of the present application;

fig. 2 is a flow chart of a motion gesture prediction method provided in an embodiment of the present application;

fig. 3 is an application schematic diagram of a motion gesture prediction method provided in an embodiment of the present application;

fig. 4 is a schematic working diagram of a multi-modal feature extraction and fusion network according to an embodiment of the present application;

FIG. 5 is a multi-scale keypoint schematic of a target individual provided in an embodiment of the present application;

FIG. 6 is a schematic flow chart of a spatio-temporal attention mechanism implementation provided in an embodiment of the present application;

fig. 7 is a schematic structural diagram of a motion gesture prediction apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a motion gesture prediction apparatus according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The pose of the target individual plays an important role in the process of interacting with the outside world. Taking a target individual as a human as an example, when a person walks or maneuvers the vehicle, the intention of the person can be predicted according to the action of the person for a few seconds, so that collision is avoided; the body language of a person also forms an important part of non-verbal communication in talking to the person. Therefore, the method realizes accurate prediction of the human motion gesture and has very wide application prospect in the fields of automatic driving, interpersonal interaction, computer animation, robot technology and the like.

At present, a cyclic neural network, a convolutional neural network, a generation countermeasure network, a graph convolution neural network and other deep learning network models can be used for realizing the prediction of the motion gesture of a target individual. But only under the condition that the human body structure is complete and is not shielded, the method has a good short-term attitude prediction effect. The input data of the methods are single motion video or a key point sequence of a target individual extracted from the motion video, referring to fig. 1, fig. 1 is a schematic diagram of single input data provided in an embodiment of the present application. The left side is a motion video input data schematic diagram, and the right side is a key point sequence input data schematic diagram. However, the motion video not only contains all information of the human body gesture, but also contains noise information irrelevant to motion prediction; the key point sequence of the target individual is obtained by extracting key points of a human body in a human body motion video, and compared with the motion video, the key point sequence only contains information of human body motion, has less noise information, but the data is easy to have the conditions of missing, repetition, inaccuracy and the like. The method has the advantages that the effect is not satisfactory in complex scenes such as human body partial shielding and long-term prediction, and the problem that the prediction accuracy of the motion gesture of a target individual is low in complex application scenes and medium-long-term prediction exists.

Based on this, the embodiments of the present application provide a motion gesture prediction method, apparatus, device, and computer readable storage medium, where first, after obtaining motion video data to be processed, preprocessing the motion video to be processed to obtain motion video data to be processed, where the motion video data to be processed includes a target individual that needs to predict a motion gesture; secondly, carrying out feature extraction and fusion on the motion video data to be processed by utilizing a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features, wherein the multi-modal features are key point sequence features of the target individual, motion speed difference features of the target individual and fusion features of motion light and shadow features; and finally, inputting the multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network. Thus, the motion video data is processed in advance to extract the key point sequence feature, the motion speed difference feature and the motion light shadow feature of the target individual from the motion video data, and the three features are fused to obtain the multi-mode feature. The multi-modal feature has less noise information, and can accurately reflect the characteristics of the target individual in motion during complex application scenes and medium-long term prediction, so that the motion gesture prediction is performed based on the multi-modal feature, and the prediction accuracy of the motion gesture of the target individual can be improved.

In order to facilitate understanding of the technical solutions provided by the embodiments of the present application, a method, an apparatus, a device, and a computer readable storage medium for predicting a motion gesture provided by the embodiments of the present application are described below with reference to the accompanying drawings.

The motion gesture prediction method provided by the embodiment of the application can be applied to a server or a system, and is not limited to this. Referring to fig. 2, fig. 2 is a flowchart of a motion gesture prediction method provided in an embodiment of the present application, where the method specifically includes S201-S203.

S201: in response to acquiring the motion video to be processed, preprocessing the motion video to be processed to obtain motion video data to be processed, wherein the motion video to be processed comprises target individuals needing to predict motion postures.

Based on a to-be-processed motion video of a target individual, after the to-be-processed motion video is acquired, a series of processing is performed based on a motion prediction model to predict the motion gesture of the target individual, wherein the to-be-processed motion video comprises the target individual needing to predict the motion gesture. The motion prediction model at least comprises a multi-mode feature extraction and fusion network, a key point multi-scale feature extraction network and a motion gesture prediction network, wherein the key point multi-scale feature extraction network is composed of an encoder layer and a decoder layer. Referring to fig. 3, fig. 3 is an application schematic diagram of a motion gesture prediction method provided in an embodiment of the present application. And processing the motion video data to be processed through each structure in the motion prediction model to finally obtain a motion prediction result of the motion gesture.

The motion of the target individual is a dynamic process, and the motion video to be processed or motion video data to be processed is essentially the data of a set of image sequences. The to-be-processed motion video of the target individual can be obtained through shooting by the 3D camera, or the to-be-processed motion video including the target individual can be called from a database of a preset path, and the to-be-processed motion video can be selected according to actual requirements. Preprocessing the motion video to be processed to extract the motion video data to be processed from the motion video to be processed, and particularly extracting the motion video data of the posture change of the target individual. The embodiment of the application is not limited to a specific implementation manner of the preprocessing, and can be selected according to actual requirements.

S202: and carrying out feature extraction and fusion on the motion video data to be processed by utilizing a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features, wherein the multi-modal features are key point sequence features of the target individual, motion speed difference features of the target individual and fusion features of motion light and shadow features.

After the motion video to be processed is input into the motion prediction model, preprocessing is carried out to obtain the motion video data to be processed, and the multi-modal feature extraction and fusion network is used for carrying out feature extraction and fusion on the motion video data to be processed to obtain the multi-modal features.

Referring to fig. 4, fig. 4 is a schematic working diagram of a multi-mode feature extraction and fusion network according to an embodiment of the present application. In one possible implementation manner, the feature extraction and fusion of the motion video data to be processed by using a pre-built multi-modal feature extraction and fusion network to obtain multi-modal features includes: extracting key point sequence characteristics of the target individual, movement speed difference characteristics of the target individual and movement light and shadow characteristics in the to-be-processed movement video data by utilizing the pre-constructed multi-mode characteristic extraction and fusion network; and fusing the key point sequence characteristics of the target individual, the motion speed difference characteristics of the target individual and the motion light and shadow characteristics to obtain the multi-mode characteristics.

And taking the motion video data to be processed as input, carrying out feature extraction on the motion video data to be processed to obtain key point sequence features of a target individual, motion speed difference features and motion light shadow features of the target individual, projecting each feature into respective embedded space, and sending the three features into a continuous attention layer.

Since the three features entered contain information in the time dimension, a novel spatiotemporal attention mechanism can be used in the attention layer. The temporal attention block may update the multi-modal fusion feature based on the input features preceding the current frame, the spatial attention block may focus on features of all keypoints and image blocks of the current frame, and then project the extracted features back into joint space and extract the multi-modal features using the input and output residuals. Specifically, the projection of the extracted features back into the joint space is automatically completed through parameter learning inside the model. The model has the learnable parameters, and can automatically learn the feature expressions of different spaces when training the model, and project the multi-modal features into the suitable expression space. The residual is the standard practice of deep learning, the motion prediction model applied in the prior art predicts the output according to the input, and the residual predicts the output according to the difference between the input and the output, so that the difficulty of training the motion prediction model can be reduced.

In one possible implementation, for a key point sequence feature of a target individual, which is a position feature of a plurality of key points, a depth algorithm tool may be used to extract the key point feature of the target individual from the motion video data, for example: openelse tools. And the motion speed difference characteristic of the target individual can be obtained by calculating the coordinate difference of the same key points of two adjacent frames on the basis of the key point sequence of the target individual. For moving light and shadow features, it can be extracted by an algorithm of deep learning, for example: the optical flow estimation algorithm is a series of algorithms such as FlowNet, RAFT (Recurrent All-Pairs Field Transformers) and GMA (Global Motion Aggregation).

The different modes describe and describe the motion states of the target individuals through different visual angles, and the motion gestures of the target individuals can be more comprehensively described by utilizing the multi-mode information, so that the motion gestures of the target individuals are more accurately predicted.

When the traditional graph convolution neural network extracts the key point sequence characteristics of a target individual of a person, key points are often regarded as a full-connection graph, and the relation between a single key point of the target individual and all other key points can be captured, but the characteristics of the key points in different layers of the target individual cannot be extracted like the convolution neural network extracts the characteristics of different layers of images through convolution.

In the motion attitude prediction process of the target individual, the position of the local key point of the target individual is frequently changed, so that the prediction difficulty is high, the overall structure of the target individual is relatively stable in the motion process relative to the local key point, and the prediction result of the key point capable of reflecting the overall structure of the target individual is more accurate.

In order to extract multi-modal characteristics of key points of a target individual at different levels by using a graph convolution network, the multi-modal characteristics of the key points of the target individual at different scales are fused to jointly realize the prediction of the position points of the key points in the motion process of the target individual, the embodiment of the application provides a motion prediction model, the multi-modal characteristics of the key points of the target individual at different scales can be extracted, and the motion gesture of the target individual is predicted by using the extracted multi-modal characteristics. Referring to fig. 5, fig. 5 is a schematic diagram of a multi-scale keypoint of a target individual according to an embodiment of the present application. The left side is the scale of the finest granularity of the key points of the target individual, and describes the change of each key point when the target individual moves.

However, when the target individual moves, local key points change quite frequently, and the prediction is difficult. And through clustering, adjacent key points can be combined into one key point, so that the human body key points under the coarse granularity scale are obtained. The granularity is coarse towards the right, so that the overall gesture of the target individual is simplified. While the overall pose changes relatively stable during motion. By utilizing the characteristics, the stable overall posture of the target individual can be predicted, and then the local posture is predicted according to the overall posture of the target individual, so that the stability and the accuracy of a prediction result can be improved.

In one possible implementation, after obtaining the multi-modal feature, the method further includes: inputting the multi-modal characteristics to an encoder layer to obtain overall key point multi-modal characteristics; and inputting the overall key point multi-mode characteristics to a decoder layer to obtain local key point multi-mode characteristics.

The keypoint multi-scale feature extraction network employs an encoder-decoder architecture, see fig. 3. The input of the encoder layer is the multi-modal characteristics which are subjected to characteristic fusion and extraction through a multi-modal characteristic fusion network. At the encoder layer, the downsampling graph rolling module and the downsampling module are sequentially arranged, the downsampling graph rolling module is responsible for capturing the relation between key points, and the downsampling module is responsible for downsampling the key points, so that the multi-mode characteristics of the whole (high-layer) key points are obtained. The multi-modal characteristics of key points in multiple scales from local to whole can be obtained after the encoder layer is passed. The output of the encoder layer is used as the input of the decoder layer to realize the overall to local restoration of the multi-modal characteristics of the target individual. At the decoder layer, the upsampling graph rolling module and the upsampling module are sequentially connected, wherein the upsampling graph rolling module is responsible for capturing the relation between key points, and the upsampling module upsamples the multi-mode features of the whole (high-layer) key points to the local (low-layer). And inputting the local key point multi-modal characteristics output by the decoder into a motion gesture prediction network to realize motion prediction of the key points of the target individuals. The number of the downsampling graph rolling modules and the downsampling modules in the encoder layer and the number of the upsampling graph rolling modules and the upsampling modules in the decoder layer can be selected according to actual requirements, which is not limited in the embodiment of the present application.

When a target individual performs an action, not all of the keypoints of the target individual change, but the positions of certain keypoints associated with the action change. Thus, if the network can be made aware of the key frames at which an action starts to occur, and the key points associated with that action, stability and accuracy of motion prediction can be improved. Aiming at key frames and key points in the motion video to be processed, which are easy to ignore by the multi-mode feature extraction and fusion network, a space-time attention mechanism is introduced in the multi-scale feature extraction network of the key points, and the change of the key frames and the key points is guided to be concerned. Referring to fig. 6, fig. 6 is a schematic flow chart of a spatio-temporal attention mechanism implementation provided in an embodiment of the present application.

The spatial attention block is extracted among the downsampling graph convolution modules, and because a plurality of downsampling graph convolution modules are used, the number of key points in each downsampling graph convolution module is different, the spatial attention needs to be extracted for each downsampling graph convolution module. The method comprises the steps of firstly carrying out maximum pooling and average pooling on the output of each downsampling graph rolling module, then merging the output, realizing feature channel fusion in a 1x1 convolution mode, and finally activating through a Sigmoid activation function. The time attention block is extracted from the up-sampling graph convolution module, and because the predicted time dimension under different scales is the same, different up-sampling graph convolution modules can share a time attention layer, and the model structure of the time attention corresponds to the space attention model approximately, except that the last activation function uses a Softmax activation function. After extracting the temporal and spatial attention, the temporal attention matrix and spatial attention machine are cross multiplied to obtain a spatio-temporal attention mechanism.

By introducing a space-time attention mechanism, key points and key frames can be focused more in the prediction process, so that the accuracy and stability of a motion prediction result are improved.

S203: and inputting the multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network.

The motion attitude prediction network can output a motion prediction result based on input multi-mode features, wherein the multi-mode features are feature fusion data of key points of change of target individuals under different scales, and accuracy of motion prediction can be improved.

In order to predict the motion of the target individual more accurately, the local key point multi-mode characteristics are used for prediction, and the local key point multi-mode characteristics can more completely represent the motion change content of the target individual, so that the prediction accuracy can be improved.

Taking a human as an example, the architecture of a human body can be characterized by key points, such as head, hand, chest and leg, and a single human body key point sequence only comprises the position features of the key points. And performing motion prediction based on the local key point multi-modal characteristics.

Based on the content of the steps S201 to S203, firstly, after a motion video to be processed is acquired, preprocessing the motion video to be processed to obtain motion video data to be processed, wherein the motion video to be processed includes a target individual needing to predict a motion gesture; secondly, carrying out feature extraction and fusion on the motion video data to be processed by utilizing a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features, wherein the multi-modal features are key point sequence features of the target individual, motion speed difference features of the target individual and fusion features of motion light and shadow features; and finally, inputting the multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network. Thus, the motion video data is processed in advance to extract the key point sequence feature, the motion speed difference feature and the motion light shadow feature of the target individual from the motion video data, and the three features are fused to obtain the multi-mode feature. The multi-modal feature has less noise information, and can accurately reflect the characteristics of the target individual in motion during complex application scenes and medium-long term prediction, so that the motion gesture prediction is performed based on the multi-modal feature, and the prediction accuracy of the motion gesture of the target individual can be improved.

The foregoing embodiments of the present application provide a motion gesture prediction method based on the foregoing. Next, a motion gesture predicting apparatus according to an embodiment of the present application, which performs the method shown in fig. 2, and next, a function of the motion gesture predicting apparatus is described, where a schematic structural diagram of the motion gesture predicting apparatus is shown in fig. 7, and the motion gesture predicting apparatus includes a preprocessing module 701, a feature extraction and fusion module 702, and a prediction module 703.

The preprocessing module 701 is configured to perform preprocessing on the motion video to be processed in response to obtaining the motion video data to be processed, so as to obtain the motion video data to be processed, where the motion video to be processed includes a target individual whose motion gesture needs to be predicted;

the feature extraction and fusion module 702 is configured to perform feature extraction and fusion on the motion video data to be processed by using a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features, where the multi-modal features are key point sequence features of the target individual, motion speed difference features of the target individual, and fusion features of motion light and shadow features;

and the prediction module 703 is configured to input the multi-modal feature to a pre-constructed motion gesture prediction network, so as to obtain a motion prediction result output by the pre-constructed motion gesture prediction network.

In one possible implementation, the feature extraction and fusion module 702 includes a feature extraction sub-module and a feature fusion sub-module:

In one possible implementation, the prediction module 703 is specifically configured to: and inputting the local key point multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network.

The application provides a motion gesture prediction device, which comprises a preprocessing module, a feature extraction and fusion module and a prediction module. The preprocessing module is used for preprocessing the motion video to be processed after the motion video to be processed is acquired, so as to obtain motion video data to be processed, wherein the motion data to be processed comprises target individuals needing to predict motion postures; the feature extraction and fusion module is used for carrying out feature extraction and fusion on the motion video data to be processed by utilizing a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features, wherein the multi-modal features are key point sequence features of the target individual, motion speed difference features of the target individual and fusion features of motion light shadow features; the prediction module is used for inputting the multi-mode characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network. Thus, the motion video data is processed in advance to extract the key point sequence feature, the motion speed difference feature and the motion light shadow feature of the target individual from the motion video data, and the three features are fused to obtain the multi-mode feature. The multi-modal feature has less noise information, and can accurately reflect the characteristics of the target individual in motion during complex application scenes and medium-long term prediction, so that the motion gesture prediction is performed based on the multi-modal feature, and the prediction accuracy of the motion gesture of the target individual can be improved.

Based on the method for predicting the motion gesture provided by the above method embodiment, the embodiment of the present application provides a device for predicting the motion gesture, referring to fig. 8, where the device includes: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

the memory is configured to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of predicting motion pose according to any of the embodiments described above.

Based on the motion gesture prediction method provided by the above method embodiment, the embodiment of the application provides a computer readable storage medium, where the computer readable storage medium stores instructions, when the instructions are executed on a device, cause the device to execute the motion gesture prediction method described in any one of the above embodiments.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for a device or device embodiment, since it is substantially similar to a method embodiment, the description is relatively simple, as relevant see the section of the method embodiment. The apparatus and apparatus embodiments described above are merely illustrative, wherein the elements described as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of predicting motion pose, the method comprising:

2. The method for predicting motion gesture according to claim 1, wherein the performing feature extraction and fusion on the motion video data to be processed by using a pre-constructed multi-modal feature extraction and fusion network to obtain multi-modal features includes:

3. The method of claim 1, wherein after obtaining the multi-modal feature, the method further comprises:

4. A method of predicting a motion gesture according to claim 3, wherein said inputting the multi-modal feature into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network comprises:

5. A motion gesture prediction apparatus, the apparatus comprising:

6. The motion gesture prediction apparatus according to claim 5, wherein the feature extraction and fusion module includes a feature extraction sub-module and a feature fusion sub-module:

7. The motion gesture prediction apparatus according to claim 5, further comprising, after obtaining the multi-modal feature, a coding sub-module and a decoding sub-module:

8. The motion gesture prediction apparatus according to claim 7, wherein the prediction module is specifically configured to: and inputting the local key point multi-modal characteristics into a pre-constructed motion gesture prediction network to obtain a motion prediction result output by the pre-constructed motion gesture prediction network.

9. A motion gesture prediction apparatus, the apparatus comprising: a processor, memory, system bus;

the processor and the memory are connected through the system bus;

the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of predicting a motion pose according to any one of claims 1 to 4.

10. A computer readable storage medium storing instructions that, when executed on a device, cause the device to perform the method of predicting a motion pose according to any one of claims 1 to 4.