CN116843878A

CN116843878A - Human body action intelligent planning system based on conditional generation countermeasure network

Info

Publication number: CN116843878A
Application number: CN202310881850.6A
Authority: CN
Inventors: 张金柱; 郭奇宙; 史汉卿; 徐航; 丁铖龙
Original assignee: Taiyuan University of Technology
Current assignee: Taiyuan University of Technology
Priority date: 2023-07-18
Filing date: 2023-07-18
Publication date: 2023-10-03

Abstract

The invention relates to the field of artificial intelligence and robots, in particular to a human body action intelligent planning system based on a conditional generation countermeasure network. The system comprises a data preprocessing module, a conditional generation countermeasure network module and an action planning module, wherein the conditional generation countermeasure network module is used for generating potential distribution characteristics of human action data for implicit learning of the countermeasure network, and a specified action plan which is similar to the actual human action is generated based on action description text given by a user. The system can solve the problems of huge teaching action cost, complicated track generation program, low action collection efficiency and the like in the existing humanoid action planning technical process, greatly enriches the diversity of robot action planning and provides an action generation model for the follow-up action intelligent planning in each field.

Description

Human body action intelligent planning system based on conditional generation countermeasure network

Technical Field

The invention relates to the field of artificial intelligence and robots, in particular to a human body action intelligent planning system for generating an countermeasure network (Conditional GenerativeAdversarial Network, abbreviated as cGAN) based on conditions.

Background

The humanoid robot humanoid action planning means that the robot simulates the motion mode and action of human beings, and realizes interaction and cooperation with the human beings. The purpose is to make the humanoid robot more similar to human in terms of appearance structure, function realization and the like, thereby improving the comfort of human-computer interaction. The humanoid robot humanoid action planning technology is widely applied to the fields of entertainment, education, medical treatment, military and the like. In the entertainment field, the humanoid robot interacts with a PC player by playing roles such as NPC and the like, and the realistic humanoid action planning can enable the PC player to obtain better game experience; in the medical field, the humanoid action planning can replace rehabilitation doctors to finish rehabilitation training; in the military field, humanoid robots can take on important tasks such as detection, reconnaissance, rescue and the like.

At present, the traditional human body action planning has a plurality of realization ways. Motion planning based on kinematics and dynamics researches basic information of human body motion, including physical quantities related to motion such as joint angle, position, speed and the like. The motion planning based on the control theory and the optimization algorithm can realize the on-line correction of the motion of the robot and design of the motion trail, wherein the control theory researches the control strategy for ensuring the accuracy and stability of the motion, and the optimization algorithm is a method for searching the optimal solution. Motion planning based on motion capture technology captures human motion data by installing a sensor or a camera, and then simulates real human motion by a computer. The human body characteristic modeling-based trajectory planning method models skeletal muscles of a human body, and simulates human body actions by knowing action rules of the human body. The planning method based on the motion constraint model considers natural constraints (joint movement range, muscle bearing capacity and the like) and task targets in human motion, reasonably constrains and adjusts the factors, and simulates human motion.

Traditional human motion planning relies on mathematics and physical modeling to design a motion trail, so that human motion cannot be realistically simulated; the movement track is designed manually, so that the movement diversity of the robot is limited, and the robot is difficult to expand to more complex movements; the motion trail and the optimization parameters are determined through a large number of calculations, and the calculation cost is huge due to the fact that the motion trail and the optimization parameters are limited by priori information such as gestures, speeds and quality; various human motions are taught based on the motion capture humanoid robot, and a realistic simulation effect can be obtained, but the motion capture requires a great deal of time and energy, and the problems of huge teaching motion cost, complicated track generation program, low motion acquisition efficiency and the like are faced.

However, in the field of deep learning, the generation countermeasure network is a very effective deep learning framework for training the generation model, potential distribution characteristics of real data can be implicitly learned, and the trained generation model has very strong data generation capability and is widely applied and developed in the fields of images, videos, audios and the like. The conditional generation countermeasure network embeds a conditional vector on the basis of the generation countermeasure network, and the synthetic data is controlled by the conditional vector.

Disclosure of Invention

The invention aims to generate potential distribution characteristics of human motion data implicitly learned by an antagonism network by using a conditional method, and generate a specified motion plan which is more similar to a real human motion based on motion description text given by a user.

The invention adopts the following technical scheme to achieve the aim:

the intelligent human body action planning system based on the conditional generation countermeasure network comprises a data preprocessing module, a conditional generation countermeasure network module and an action planning module, wherein the data preprocessing module comprises a sentence vector coding module, an action information processing module and an action classification description module, preprocesses a text description file, an acquisition action capture c3d file and a classification description file respectively, and finally outputs a human body action data set hdf5 file;

the conditional generation countermeasure network module comprises a data set sample retriever, a conditional generation countermeasure network model and a model trainer; the data set sample retriever retrieves a batch of sample data in the human action data set hdf5 file as batch processing samples; the model trainer receives the batch processing samples output by the data set sample retriever and the conditional generation countermeasure network model, further trains the conditional generation countermeasure network model in the model trainer, continuously adjusts network parameters in the training process to improve the performance and accuracy of the model, and generates text corresponding action information by utilizing the conditional generation countermeasure network model after training is finished;

The action planning module comprises a generated data post-processing model, a virtual humanoid robot model and an action visualization interface; the method comprises the steps that the generated data post-processing model carries out data fitting on action information corresponding to text to form continuous joint action curves, the continuous joint action curves are respectively input into a virtual humanoid robot model and an action visualization interface, the virtual humanoid robot model receives the continuous joint action curves to generate robot actions, the robot actions are input into the action visualization interface, the action visualization interface displays the continuous joint action curves and the robot actions to a user, and the action visualization interface also inputs action description text given by the user into a conditional generation countermeasure network model.

Further, the sentence vector coding module codes the input text description file into a sentence vector expression with a fixed length through a pre-training BERT sentence vector coding model, and is used for representing semantic information of a text; the motion information processing module preprocesses the input collected motion capture c3d file to obtain track information of the output point position of the tail end of the limb and angle information of the joint point position of the human body model, combines the complete track information and the complete angle information, and divides the track information and the angle information into three channels of x, y and z for storing human body motion data; and the action classification description module classifies the classification description files and is used as a judgment basis for distinguishing different actions.

Further, the human action data set hdf5 file includes action embedding information, action data information, an action file name, an action text description, and an action classification description.

Further, the data structures in the batch samples are correct action information, correct embedded information, incorrect action information, incorrect embedded information and action text information.

Further, the conditional generation countermeasure network model comprises a discriminator and a generator, wherein the generator generates text corresponding action information according to the input action description text, and the discriminator is used for evaluating the authenticity of the text corresponding action information;

the discriminator comprises a first discriminating module, a condition vector embedding module and a second discriminating module; the input of the first judging module is action information, and the action information is used for adding a data channel through 1 convolution layer with the size of 3 multiplied by 3, a spectrum normalization layer, a LeakyReLU activating function and 5 residual error blocks of the judging device, and pre-learning action data characteristics before condition vector embedding; the condition vector is from the correct embedded information or the wrong embedded information in the batch processing samples, and the content of the condition vector is determined by the training process of the model trainer. The input of the condition vector embedding module is the output of the first judging module and a condition vector, the output of the first judging module is convolved, the condition vector passes through a full connection layer, a spectrum normalization layer and a LeakyReLU activation function, and finally, a convolution result and the processed condition vector are spliced; the input of the second judging module is the output of the conditional vector embedding module, and the input is finally mapped into a scalar through 7 judging residual blocks, 1 global average convergence layer, 1 full-connection layer and 1 layer spectrum normalization, and feature learning of action information and conditional vectors is synthesized;

The generator comprises an input encoding module, a first generating module, a second generating module and a third generating module; the input coding module receives an action description text given by a user, obtains a one-dimensional sentence vector with a specified dimension through a BERT sentence vector coding model, and then characterizes semantic information of the text through a 1-layer full-connection layer, a 1-layer batch normalization and a 1-layer ReLU activation function to generate a feature expression of the sentence vector; after the feature expression of the sentence vector and the random noise vector are spliced, inputting the sentence vector and the random noise vector into a first generation module, carrying out 1-layer full-connection layer, 1-layer batch normalization and 1-layer ReLU activation function, and then carrying out interpolation to adjust the tensor shape to a specified size to generate the sentence vector and the random noise feature expression; the second generation module receives sentence vectors and random noise feature expressions, and generates feature expressions of motion information by stacking a plurality of groups of generator residual blocks and double up-sampling layers for feature generation of motion data and then adjusting the feature expressions to a specified size through an up-sampling layer of the motion size; and the third generation module receives the characteristic expression of the action information, and finally generates the action information corresponding to the text through 1-layer batch normalization, 1-layer ReLU activation function, 1-layer 3 multiplied by 3 convolution layer and 1-layer Tanh activation function.

Further, the first discrimination module and the second discrimination module comprise the discriminator residual blocks with the same structure, so that the problems of gradient elimination and gradient explosion in the deep neural network training process can be effectively solved.

If the residual block of the discriminator is the first residual block, the tensor of the residual block of the input discriminator is subjected to two groups of LeakyReLU activation functions, a 3X 3 convolution layer and spectrum normalization, and the input tensor is directly copied, then the two tensors are added to obtain a tensor with the same dimension as the former, and then the LeakyReLU activation function operation is carried out on the tensor to obtain a final output tensor; if the residual block of the discriminator is not the first residual block, the tensor of the residual block of the input discriminator is subjected to two groups of LeakyReLU activation functions, a 3X 3 convolution layer and spectrum normalization, and the input tensor is subjected to characteristic size adjustment by using the 1X 1 convolution layer and spectrum normalization, and then the two tensors are added to obtain a tensor with the same dimension as the former tensor, and then the operation of the LeakyReLU activation functions is carried out to obtain the final output tensor.

Further, if the generator residual block is the first residual block, the tensor of the input generator residual block is subjected to two groups of 3×3 convolution layers and batch normalization, and a ReLU activation function is added before the second group, and meanwhile, the input tensor is directly copied, and then the two tensors are added to obtain a tensor with the same dimension as the former, and then the ReLU activation function operation is performed on the tensor to obtain a final output tensor; if the generator residual block is not the first residual block, tensors input into the generator residual block are subjected to two groups of 3×3 convolution layers and batch normalization, a ReLU activation function is added before the second group, meanwhile, the input tensors are subjected to the size adjustment of features by using the 1×1 convolution layers and batch normalization, the input tensors and the input tensors are added to obtain a tensor with the same dimension as the input tensor, and then the input tensor is subjected to the operation of the ReLU activation function to obtain a final output tensor.

Further, the objective function of the conditional generation countermeasure network model is:

wherein D represents a discriminator, G represents a generator, V represents an objective function, E represents a desired function, and x-P _data (x) Representing motion data information, z-P, obtained from a human motion data set hdf5 file _z (z) represents random noise sampled from normal distribution, log D (x|y) represents log calculation score of motion data information x in the arbiter D under constraint of the condition vector y, log (1-D (G (z|y))) represents log calculation score of motion information generated by the generator G in the arbiter D under constraint of the condition vector y, max, and random noise z is taken as input _D Indicating that the arbiter D wants to maximize the objective function, i.e. requires an increase in D (x|y), a decrease in D (G (z|y)), min _G The representation generator G wants to minimize the objective function, i.e. requires an increase in D (G (z|y)), and finally discriminatesThe generator D and the generator G form an antagonistic relationship.

Further, the training condition type in the model trainer to generate the countermeasure network model comprises obtaining sample data, a fixed generator training discriminator and a fixed discriminator training generator;

the obtained sample data obtains correct action information, correct embedded information and error action information from a batch sample, creates a real label and a false label at the same time, and performs label smoothing on the real label;

The fixed generator trains the discriminator, firstly, correct action information and correct embedding information are input into the discriminator, loss real_loss of real data is calculated, the label is a smoothed real label, then random noise and correct embedding information are input into the generator, the generated action information and the correct embedding information are input into the discriminator, loss fake_loss of the generated data is calculated, the label is a false label, then error action information and correct embedding information are input into the discriminator, loss wrong_loss of error data is calculated, the label is an error label, and finally, a loss function of the discriminator is calculated;

the fixed arbiter trains the generator, firstly inputs random noise and correct embedded information into the generator, then inputs generated action information and correct embedded information into the arbiter, then calculates the loss G_loss of generated data, the label is a real label, and finally calculates the loss function of the generator.

Further, the loss function of the computation arbiter is as follows:

D_loss＝real_loss+fake_loss+wrong_loss

wherein D_loss is an overall loss function, real_loss is the loss of real data, the discrimination capability of a discriminator on the real data is represented, fake_loss is the loss of generated data, the discrimination capability of the discriminator on the generated data is represented, wrong_loss is the loss of error data, and the discrimination capability of the discriminator on the error data is represented;

The loss function of the computation generator is as follows:

G_loss＝-[y·log(D(G(z)))+(1-y)·log(1-D(G(z)))]

where y represents the true label, D (G (z)) represents the decision of the discriminator on the text corresponding action information, and the objective of the loss function is to minimize the difference between the generated sample and the true sample, thereby improving the generating capability of the generator.

Compared with the prior art, the invention has the following advantages:

aiming at the problems in the motion planning method of the humanoid robot in the prior art, the data preprocessing module adjusts the human motion data structure as prior information of model training, thereby being beneficial to the feature extraction of human motion data; the condition generation countermeasure network module effectively extracts the characteristics of human body movement from the data dimension, so that human actions can be simulated realistically; according to the input action description text, the module generates a realistic action plan meeting the requirements, reduces the workload of anthropomorphic action track planning, and improves the diversity of anthropomorphic action track planning; the model does not need to establish a humanoid evaluation index or a cost function to optimize the motion trail of the robot, and improves the generalization capability of motion planning. The motion planning module designs a motion visualization interface, displays multidimensional information of a human motion planning result, comprises the generation of motion data and the visualization of virtual model motions, and reduces the economic cost of anthropomorphic motion track planning.

The intelligent planning system for human body actions can solve the problems of huge teaching action cost, complicated track generation program, low action acquisition efficiency and the like in the process of the existing humanoid action planning technology, greatly enriches the diversity of robot action planning, and provides an action generation model for the follow-up intelligent action planning in various fields.

Drawings

FIG. 1 is a schematic illustration of a manikin joint according to the present invention;

FIG. 2 is a schematic diagram of an action sample data structure according to the present invention;

FIG. 3 is a schematic flow chart of a data preprocessing module according to the present invention;

FIG. 4 is a schematic diagram of a batch sample data structure according to the present invention;

FIG. 5 is a schematic diagram of a conditional generation countermeasure network model of the present invention;

FIG. 6 is a schematic diagram of a model trainer training process in accordance with the present invention;

FIG. 7 is a schematic diagram of an action visualization interface of the present invention;

fig. 8 is a schematic diagram of the intelligent human motion planning system according to the present invention.

Detailed Description

In order to further illustrate the technical scheme of the invention, the invention is further illustrated by the following examples.

The generation countermeasure network in the prior art is mainly used for data generation in the fields of images, videos, texts and the like, and cannot be directly used for generating action data with time sequence information. However, the human model built based on the kinematics theory has three-dimensional joint rotation information (as shown in fig. 1), so that a time axis can be used as a first dimension of data, a joint type can be used as a second dimension of data, a three-dimensional axis of the joint can be used as a third dimension (channel number) of data, and one action data sample is a three-dimensional tensor (as shown in fig. 2).

As shown in fig. 8, the human motion intelligent planning system based on the conditional generation countermeasure network of the present embodiment includes a data preprocessing module 1, a conditional generation countermeasure network module 2, and a motion planning module 3.

As shown in fig. 3, the data preprocessing module 1 module includes a sentence vector encoding module 1.1, an action information processing module 1.2 and an action classification description module 1.3, which respectively preprocesses a text description file, an acquisition action capture c3d file and a classification description file, and finally outputs a human action data set hdf5 file 1.4, wherein the human action data set hdf5 file 1.4 includes action embedded information 1.4.1, action data information 1.4.2, action file name 1.4.3, action text description 1.4.4 and action classification description 1.4.5.

The sentence vector coding module 1.1 codes an input text description file into a sentence vector expression with a fixed length through a pre-training BERT sentence vector coding model and is used for representing semantic information of a text, and the module outputs action embedded information 1.4.1 and action text description 1.4.4 in a human body action data set hdf5 file 1.4; the motion information processing module 1.2 preprocesses the input acquisition motion capture c3d file to obtain track information of the output point position of the tail end of the limb and angle information of the joint point position of the human model, respectively judging whether the information is complete or not, discarding the data if the information is incomplete, and re-analyzing the next data; if the human body motion data is complete, combining complete track information and angle information, dividing the track information into three channels of x, y and z for storing the human body motion data, and outputting the motion data information 1.4.2 in the hdf5 file 1.4 as the human body motion data set by the module; the action classification description module 1.3 classifies the classification description files and is used as a judging basis for discriminating different actions, and the output of the module is the action classification description 1.4.5 in the human action data set hdf5 file 1.4.

The conditional generation countermeasure network module 2 comprises a data set sample retriever 2.1, a conditional generation countermeasure network model 2.2 and a model trainer 2.3; a batch of sample data is first retrieved as batch samples 2.1.6 from the human motion data set hdf5 file 1.4 by means of the data set sample retriever 2.1. For the problem that the time frame numbers of motion capture are not consistent, in the data set sample extractor 2.1, a batch of motion capture data is subjected to expansion operation. The invention preferably provides an expansion method that: the value of the last time step is repeated until the motion sequence is filled to the maximum length of the batch of samples and the batch of filled samples is assembled into a four-dimensional tensor in the time dimension.

As shown in fig. 4, the data structures in the batch processing sample 2.1.6 are correct action information 2.1.1, correct embedded information 2.1.2, incorrect action information 2.1.3, incorrect embedded information 2.1.4 and action text information 2.1.5. Based on the correct action information 2.1.1, the action embedded information 1.4.1 is used as the correct embedded information 2.1.2, the action text description 1.4.4 is used as the action text information 2.1.5, and other action embedded information is randomly selected from the human action data set hdf5 file 1.4 to be used as the error embedded information 2.1.4, and it is required to be specifically noted that the selection method of the error action information 2.1.3 is described as follows: and taking the action classification description as a judging condition, randomly selecting action data under different action classification descriptions, and calling action data information from the action data as error action information 2.1.3.

The batch sample 2.1.6 is input into a conditional generation antagonism network model 2.2. As shown in fig. 5, the network model includes a discriminator 2.2.1 and a generator 2.2.2, the generator 2.2.2 generates text corresponding action information according to the input action description text, and the discriminator 2.2.1 is used for evaluating the authenticity of the text corresponding action information;

the discriminator 2.2.1 comprises a first discriminating module 2.2.1.1, a condition vector embedding module 2.2.1.2 and a second discriminating module 2.2.1.3; the input of the first judging module 2.2.1.1 is action information, (the action information is from correct action information 2.1.1 or error action information 2.1.3 or text corresponding action information in the batch processing samples 2.1.6, the content of the action information is determined by the training process of the model training device 2.3), and then the action information is subjected to 1 convolution layer of 3×3, a spectrum normalization layer, a LeakyReLU activating function and 5 judging device residual blocks 2.2.1.4, so that a data channel is added, and the action data characteristics are pre-learned before the condition vector is embedded; the condition vector is derived from either the correct embedded information 2.1.2 or the incorrect embedded information 2.1.4 in the batch of samples 2.1.6, the content of which is determined by the training process of the model trainer 2.3. The input of the condition vector embedding module 2.2.1.2 is the output of the first judging module 2.2.1.1 and the condition vector, the output of the first judging module 2.2.1.1 is convolved, the condition vector passes through a full connection layer, a spectrum normalization layer and a LeakyReLU activation function, and finally the convolution result and the processed condition vector are output for splicing; the input of the second judging module 2.2.1.3 is the output of the conditional vector embedding module 2.2.1.2, and the input is finally mapped into a scalar through normalization of 7 judging residual blocks 2.2.1.4, 1 global average convergence layer, 1 full-connection layer and 1 layer spectrum, and feature learning of action information and conditional vectors is synthesized;

The generator 2.2.2 comprises an input encoding module 2.2.2.1, a first generation module 2.2.2.2, a second generation module 2.2.2.3 and a third generation module 2.2.2.4; the input coding module 2.2.2.1 receives a motion description text given by a user, obtains a one-dimensional sentence vector with a specified dimension through a BERT sentence vector coding model, and then characterizes semantic information of the text through a 1-layer full-connection layer, a 1-layer batch normalization and a 1-layer ReLU activation function to generate feature expression of the sentence vector; after the feature expression of the sentence vector and the random noise vector are spliced, inputting the feature expression and the random noise vector into a first generation module 2.2.2.2, carrying out 1-layer full-connection layer, 1-layer batch normalization and 1-layer ReLU activation function, and then carrying out interpolation to adjust the tensor shape to a specified size to generate the sentence vector and the random noise feature expression; the second generation module 2.2.2.3 receives sentence vectors and random noise feature expressions, stacks a plurality of groups of generator residual blocks 2.2.2.5 and double up-sampling layers for feature generation of motion data, and adjusts the feature expressions to a specified size through an up-sampling layer of the motion size to generate feature expressions of motion information; the third generating module 2.2.2.4 receives the feature expression of the action information, and generates the action information corresponding to the text finally through 1-layer batch normalization, 1-layer ReLU activation function, 1-layer 3×3 convolution layer and 1-layer Tanh activation function.

The first discrimination module 2.2.1.1 and the second discrimination module 2.2.1.3 comprise a discrimination residual block 2.2.1.4 with the same structure;

if the residual block 2.2.1.4 of the discriminator is the first residual block, inputting the tensor of the residual block 2.2.1.4 of the discriminator, through two groups of the LeakyReLU activation functions, a 3X 3 convolution layer and spectrum normalization, directly copying the input tensor, adding the two to obtain a tensor with the same dimension as the former, and then carrying out the LeakyReLU activation function operation on the tensor to obtain a final output tensor; if the residual block 2.2.1.4 of the discriminator is not the first residual block, the tensor of the residual block 2.2.1.4 of the input discriminator is subjected to two groups of LeakyReLU activation functions, a 3×3 convolution layer and spectrum normalization, and the input tensor is subjected to the size adjustment of the characteristic by using the 1×1 convolution layer and the spectrum normalization, and then the two tensors are added to obtain a tensor with the same dimension as the former tensor, and then the LeakyReLU activation function operation is carried out on the tensor to obtain the final output tensor.

If the generator residual block 2.2.2.5 is the first residual block, the tensor of the input generator residual block 2.2.2.5 is subjected to two groups of 3×3 convolution layers and batch normalization, and the ReLU activation function is added before the second group, and the input tensor is directly copied, and then the two tensors are added to obtain a tensor with the same dimension as the former, and then the ReLU activation function operation is performed on the tensor to obtain a final output tensor; if the generator residual block 2.2.2.5 is not the first, the tensor input to the generator residual block 2.2.2.5 is subjected to two sets of 3×3 convolution layers and batch normalization, and the ReLU activation function is added before the second set, while the input tensor is subjected to the size of the feature adjustment by using the 1×1 convolution layers and batch normalization, and the two are added to obtain a tensor with the same dimension as the former, and then the tensor is subjected to the ReLU activation function operation to obtain the final output tensor.

It should be noted that, according to different positions and functions of the arbiter and the generator in generating the reactance network, it is preferable in the present invention that:

in the arbiter 2.2.1, all normalization operations use spectral normalization and all activation functions use the LeakyReLU activation function. The size of the weights can be effectively controlled by adopting spectrum normalization, so that the risk of overfitting is reduced. The purpose of using the LeakyReLU activation function is to avoid the neuronal "death" phenomenon of the ReLU activation function in the negative interval, thereby improving the nonlinear expression capacity of the model.

In the generator 2.2.2, all normalization operations are batch normalization, the Tanh activation function is used when the generator 2.2.2 outputs last, and all other activation functions are ReLU activation functions. The purpose of the batch normalization is to speed up the convergence speed of the neural network while the risk of gradient extinction can be reduced. The purpose of using the ReLU activation function is to allow the neural network to converge faster and without the problem of gradient extinction during back propagation. In addition, the ReLU activation function is faster in calculation speed compared with other activation functions, and is one of the most widely used activation functions at present. The Tanh activation function controls the generated action information in the range of [ -1,1] so as to carry out subsequent processing in the generation of the data post-processing model 3.1; in addition, the Tanh has nonlinear characteristics, can better fit the characteristics of real data, and is beneficial to increasing the expression capacity of the model.

The conditional generation of the objective function against the network model (2.2) is:

wherein D represents a discriminator, G represents a generator, V represents an objective function, E represents a desired function, and x-P _data (x) Representing motion data information 1.4.2, z-P obtained from a human motion data set hdf5 file 1.4 _z (z) represents random noise sampled from normal distribution, log D (x|y) represents log calculation score of motion data information x in the arbiter D under constraint of the condition vector y, log (1-D (G (z|y))) represents log calculation score of motion information generated by the generator G in the arbiter D under constraint of the condition vector y, max, and random noise z is taken as input _D Indicating that the arbiter D wants to maximize the objective function, i.e. requires an increase in D (x|y), a decrease in D (G (z|y)), min _G Representing that generator G wants to minimize the objective function, i.e. that D (G (z|y)) needs to be increased, the final arbiter D and generator G form a antagonistic relationship.

The model trainer 2.3 receives a batch processing sample 2.1.6 output by the conditional expression generation antagonism network model 2.2 and the data set sample retriever 2.1, further trains the conditional expression generation antagonism network model 2.2 in the model trainer 2.3, continuously adjusts network parameters in the training process to improve the performance and the accuracy of the model, gradually learns potential variables in the data set by the generator and the discriminant, acquires sample data 2.3.1, trains the discriminant 2.3.2 and the fixed discriminant and trains the generator 2.3.3, trains the epoch times (as shown in fig. 6), and generates text corresponding action information by using the conditional expression generation antagonism network model 2.2 after the training is completed.

Firstly, acquiring sample data 2.3.1, acquiring correct action information 2.1.1, correct embedded information 2.1.2 and incorrect action information 2.1.3 from a batch sample 2.1.6, simultaneously creating a real label and a false label, and performing label smoothing on the real label to ensure that a result is not output by a discriminator 2.2.1 excessively and credibly;

secondly, training the discriminator 2.3.2 by the fixed generator, firstly inputting correct action information 2.1.1 and correct embedding information 2.1.2 into the discriminator 2.2.1, calculating loss real_loss of real data, wherein the label is a smoothed real label, then inputting random noise and correct embedding information 2.1.2 into the generator 2.2.2, inputting the generated action information and correct embedding information 2.1.2 into the discriminator 2.2.1, calculating loss fake_loss of generated data, the label is a false label, secondly inputting error action information 2.1.3 and correct embedding information 2.1.2 into the discriminator 2.2.1, calculating loss wrong_loss of error data, and calculating a loss function of the discriminator, wherein the loss function of the discriminator is calculated as follows:

D_loss＝real_loss+fake_loss+wrong_loss

wherein D_loss is an overall loss function, real_loss is the loss of real data, the discrimination capability of a discriminator on the real data is represented, fake_loss is the loss of generated data, the discrimination capability of the discriminator on the generated data is represented, wrong_loss is the loss of error data, and the discrimination capability of the discriminator on the error data is represented; the three are binary cross entropy loss functions, the loss functions are used for training the capabilities of the discriminator for identifying real data and generating samples, and finally updating and recording the gradient of the discriminator 2.2.1 and the learning rate;

Thirdly, fixing the discriminator training generator 2.3.3, firstly inputting random noise and correct embedded information 2.1.2 into the generator 2.2.2, then inputting generated action information and correct embedded information 2.1.2 into the discriminator 2.2.1, secondly calculating the loss G_loss of generated data, wherein the label is a real label, and then calculating a loss function of the generator, wherein the formula is as follows:

G_loss＝-[y·log(D(G(z)))+(1-y)·log(1-D(G(z)))]

wherein y represents a real label, D (G (z)) represents a judging result of the action information corresponding to the text by the discriminator, and finally updating and recording the gradient and the learning rate of the generator 2.2.2.

The action planning module 3 comprises a generated data post-processing model 3.1, a virtual humanoid robot model 3.2 and an action visualization interface 3.3; the generated data post-processing model 3.1 performs data fitting on the motion information corresponding to the text to form a continuous joint motion curve, the continuous joint motion curve is input into the virtual humanoid robot model 3.2 and the motion visualization interface 3.3 respectively, the virtual humanoid robot model 3.2 receives the continuous joint motion curve to generate a robot motion, the robot motion is input into the motion visualization interface 3.3, the motion visualization interface 3.3 displays the continuous joint motion curve and the robot motion to a user, and the motion visualization interface 3.3 also inputs the motion description text given by the user into the conditional generation countermeasure network model 2.2.

After model training is completed in the model trainer 2.3, the model can be used for generating text corresponding action information, at the moment, the generated text corresponding action information is discrete data, and a certain post-processing step is completed in the generated data post-processing model 3.1, so that the text corresponding action information can be input into the virtual humanoid robot model 3.2 and the action visualization interface 3.3. The preferred discrete data fitting method of the invention is a spline interpolation-based method, and the spline interpolation can better capture nonlinear dynamic changes and generate smoother and natural motion tracks. The method can fit discrete data by constructing a smooth interpolation curve on a given group of control points so as to form a continuous joint action curve, and the continuous joint action curve is respectively input into the virtual humanoid robot model 3.2 and the action visualization interface 3.3;

the virtual humanoid robot model 3.2 receives the continuous joint motion curves, generates robot motions, and inputs the robot motions to the motion visualization interface 3.3, the motion visualization interface 3.3 is an interactive application program (as shown in fig. 7), which provides a convenient way for the user to input instructions and control the motions of the virtual humanoid robot. When a user inputs an action description text, the action visualization interface 3.3 analyzes the text, calls the trained conditional generation countermeasure network model 2.2 to generate action information corresponding to the text, processes the generated action information through the generated data post-processing model 3.1, transmits the processed action information to the virtual humanoid robot model 3.2, and moves and operates the humanoid robot in the model according to given action information. Meanwhile, the action visualization interface 3.3 also provides an interactive mode to display the intelligent planning result of the human actions, and the interface can display information such as joint actions, speeds, tracks and the like of the virtual humanoid robot at different time points and other physical characteristics related to the information such as force, momentum, angular momentum and the like, so that the intelligent planning result of the human actions is displayed to a user, and the movement characteristics and the performance of the virtual humanoid robot are better understood.

While the principal features and advantages of the present invention have been shown and described, it will be apparent to those skilled in the art that the invention is not limited to the details of the foregoing exemplary embodiments, but that the invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. An intelligent human body action planning system based on a conditional generation countermeasure network is characterized in that: the system comprises a data preprocessing module (1), a conditional generation countermeasure network module (2) and an action planning module (3);

The data preprocessing module (1) comprises a sentence vector encoding module (1.1), an action information processing module (1.2) and an action classification description module (1.3), and is used for respectively preprocessing a text description file, an acquisition action capture c3d file and a classification description file and finally outputting a human action data set hdf5 file (1.4);

the conditional generation countermeasure network module (2) comprises a data set sample retriever (2.1), a conditional generation countermeasure network model (2.2) and a model trainer (2.3); the data set sample retriever (2.1) retrieves a batch of sample data in the human motion data set hdf5 file (1.4) as batch samples (2.1.6); the model trainer (2.3) receives a batch processing sample (2.1.6) output by the conditional generation countermeasure network model (2.2) and the data set sample extractor (2.1), further trains the conditional generation countermeasure network model (2.2) in the model trainer (2.3), and generates text corresponding action information by utilizing the conditional generation countermeasure network model (2.2) after training is finished;

the action planning module (3) comprises a generated data post-processing model (3.1), a virtual humanoid robot model (3.2) and an action visualization interface (3.3); the method comprises the steps that the generated data post-processing model (3.1) carries out data fitting on action information corresponding to text to form continuous joint action curves, the continuous joint action curves are respectively input into the virtual humanoid robot model (3.2) and the action visualization interface (3.3), the virtual humanoid robot model (3.2) receives the continuous joint action curves, robot actions are generated, the robot actions are input into the action visualization interface (3.3), the action visualization interface (3.3) displays the continuous joint action curves and the robot actions to a user, and the action visualization interface (3.3) also inputs action description texts given by the user into the conditional generation countermeasure network model (2.2).

2. The intelligent human action planning system based on a conditional generation countermeasure network of claim 1, wherein: the sentence vector coding module (1.1) codes an input text description file into a sentence vector expression with a fixed length through a pre-training BERT sentence vector coding model, and is used for representing semantic information of a text; the motion information processing module (1.2) preprocesses an input acquisition motion capture c3d file to obtain track information of an output point position of the tail end of the limb and angle information of a joint point position of a human body model, combines the complete track information and the complete angle information, and divides the track information and the angle information into three channels of x, y and z for storing human body motion data; the action classification description module (1.3) classifies the classification description files and is used as a judgment basis for distinguishing different actions.

3. The intelligent human action planning system based on a conditional generation countermeasure network of claim 1, wherein: the human motion data set hdf5 file (1.4) comprises motion embedded information (1.4.1), motion data information (1.4.2), a motion file name (1.4.3), a motion text description (1.4.4) and a motion classification description (1.4.5).

4. The intelligent human action planning system based on a conditional generation countermeasure network of claim 1, wherein: the data structures in the batch samples (2.1.6) are correct action information (2.1.1), correct embedded information (2.1.2), incorrect action information (2.1.3), incorrect embedded information (2.1.4) and action text information (2.1.5).

5. The intelligent human action planning system based on a conditional generation countermeasure network of claim 1, wherein: the conditional generation countermeasure network model (2.2) comprises a discriminator (2.2.1) and a generator (2.2.2), the generator (2.2.2) generates text corresponding action information according to the input action description text, and the discriminator (2.2.1) is used for evaluating the authenticity of the text corresponding action information;

the discriminator (2.2.1) comprises a first discriminating module (2.2.1.1), a condition vector embedding module (2.2.1.2) and a second discriminating module (2.2.1.3); the input of the first judging module (2.2.1.1) is action information, and the first judging module is used for adding a data channel through 1 convolution layer of 3 multiplied by 3, a spectrum normalization layer, a LeakyReLU activating function and 5 residual error blocks (2.2.1.4) of the judging device and pre-learning action data characteristics before embedding a condition vector; the input of the condition vector embedding module (2.2.1.2) is the output of the first judging module (2.2.1.1) and the condition vector, the output of the first judging module (2.2.1.1) is convolved, and the condition vector is subjected to a full connection layer, a spectrum normalization layer and a LeakyReLU activation function, and finally the convolution result and the processed condition vector are output for splicing; the input of the second judging module (2.2.1.3) is the output of the conditional vector embedding module (2.2.1.2), and the input is finally mapped into a scalar through 7 residual blocks (2.2.1.4) of the discriminators, 1 global average convergence layer, 1 full-connection layer and 1 layer spectrum normalization, and the characteristic learning of the action information and the conditional vector is synthesized;

The generator (2.2.2) comprises an input encoding module (2.2.2.1), a first generating module (2.2.2.2), a second generating module (2.2.2.3) and a third generating module (2.2.2.4); the input coding module (2.2.2.1) receives a motion description text given by a user, a one-dimensional sentence vector with a specified dimension is obtained through a BERT sentence vector coding model, and semantic information of the text is represented through a 1-layer full-connection layer, a 1-layer batch normalization and a 1-layer ReLU activation function, so that feature expression of the sentence vector is generated; after the feature expression of the sentence vector and the random noise vector are spliced, inputting the feature expression and the random noise vector into a first generation module (2.2.2.2), carrying out 1-layer full-connection layer, 1-layer batch normalization and 1-layer ReLU activation function, and then carrying out interpolation to adjust the tensor shape to a specified size to generate the sentence vector and the random noise feature expression; the second generation module (2.2.2.3) receives sentence vectors and random noise feature expressions, and generates feature expressions of motion information by stacking a plurality of groups of generator residual blocks (2.2.2.5) and double up-sampling layers for feature generation of motion data and then adjusting the feature expressions to a specified size through an action size up-sampling layer; and the third generation module (2.2.2.4) receives the characteristic expression of the action information, and finally generates the action information corresponding to the text through 1-layer batch normalization, 1-layer ReLU activation function, 1-layer 3×3 convolution layer and 1-layer Tanh activation function.

6. The intelligent human action planning system based on conditional generation of the countermeasure network of claim 5, wherein: the first discrimination module (2.2.1.1) and the second discrimination module (2.2.1.3) comprise discrimination residual blocks (2.2.1.4) with the same structure;

if the residual block (2.2.1.4) of the discriminator is the first residual block, the tensor of the residual block (2.2.1.4) of the input discriminator is subjected to two groups of LeakyReLU activation functions, a 3X 3 convolution layer and spectrum normalization, and the input tensor is directly copied, then the two tensors are added to obtain a tensor with the same dimension as the former tensor, and then the operation of the LeakyReLU activation functions is carried out to obtain a final output tensor; if the residual block (2.2.1.4) of the discriminator is not the first residual block, the tensor of the residual block (2.2.1.4) of the input discriminator is subjected to two groups of LeakyReLU activation functions, 3X 3 convolution layers and spectrum normalization, and the input tensor is subjected to the size adjustment of the characteristics by using the 1X 1 convolution layers and the spectrum normalization, and then the two tensors are added to obtain a tensor with the same dimension as the former tensor, and then the LeakyReLU activation functions are operated on the tensor to obtain the final output tensor.

7. The intelligent human action planning system based on conditional generation of the countermeasure network of claim 5, wherein: a generator residual block (2.2.2.5) in the second generating module (2.2.2.3), if the generator residual block (2.2.2.5) is a first residual block, inputting tensors of the generator residual block (2.2.2.5) through two groups of 3×3 convolution layers and batch normalization, adding a ReLU activation function before the second group, directly copying the input tensors, adding the two to obtain a tensor with the same dimension as the former, and then performing a ReLU activation function operation on the tensor to obtain a final output tensor; if the generator residual block (2.2.2.5) is not the first residual block, the tensor input into the generator residual block (2.2.2.5) is subjected to two groups of 3×3 convolution layers and batch normalization, a ReLU activation function is added before the second group, the input tensor is subjected to batch normalization by using the size of the 1×1 convolution layer to adjust the characteristic, the tensor is added to obtain a tensor with the same dimension as the first tensor, and then the final output tensor is obtained by performing the ReLU activation function operation on the tensor.

8. The intelligent human action planning system based on conditional generation of the countermeasure network of claim 5, wherein: the conditional generation of the objective function against the network model (2.2) is:

wherein D represents a discriminator, G represents a generator, V represents an objective function, E represents a desired function, and x-P _data (x) Representing motion data information (1.4.2), z-P, obtained from a human motion data set hdf5 file (1.4) _z (z) represents random noise sampled from normal distribution, log D (x|y) represents log calculation score of motion data information x in the arbiter D under constraint of the condition vector y, log (1-D (G (z|y))) represents log calculation score of motion information generated by the generator G in the arbiter D under constraint of the condition vector y, max, and random noise z is taken as input _D Indicating that the arbiter D wants to maximize the objective function, i.e. requires an increase in D (x|y), a decrease in D (G (z|y)), min _G Representing that generator G wants to minimize the objective function, i.e. that D (G (z|y)) needs to be increased, the final arbiter D and generator G form a antagonistic relationship.

9. The intelligent human action planning system based on a conditional generation countermeasure network of claim 1, wherein: the training conditional generation of the countermeasure network model (2.2) in the model trainer (2.3) comprises the steps of acquiring sample data (2.3.1), a fixed generator training discriminant (2.3.2) and a fixed discriminant training discriminant (2.3.3);

The obtained sample data (2.3.1) is used for obtaining correct action information (2.1.1), correct embedded information (2.1.2) and incorrect action information (2.1.3) from a batch processing sample (2.1.6), creating a real label and a false label at the same time, and performing label smoothing on the real label;

the fixed generator trains the discriminator (2.3.2), firstly inputs correct action information (2.1.1) and correct embedding information (2.1.2) into the discriminator (2.2.1), calculates loss real_loss of real data, the label is a smoothed real label, then inputs random noise and correct embedding information (2.1.2) into the generator (2.2.2), inputs generated action information and correct embedding information (2.1.2) into the discriminator (2.2.1), calculates loss fake_loss of generated data, the label is a false label, then inputs error action information (2.1.3) and correct embedding information (2.1.2) into the discriminator (2.2.1), calculates loss wrng_loss of error data, the label is an error label, and finally calculates a loss function of the discriminator;

the fixed arbiter trains the generator (2.3.3), firstly inputs random noise and correct embedded information (2.1.2) into the generator (2.2.2), then inputs generated action information and correct embedded information (2.1.2) into the arbiter (2.2.1), then calculates the loss G_loss of generated data, the label is a real label, and finally calculates the loss function of the generator.

10. The intelligent human action planning system based on a conditional generation countermeasure network of claim 9, wherein: the loss function of the computation arbiter is as follows:

D_loss＝real_loss+fake_loss+wrong_loss

the loss function of the computation generator is as follows:

G_loss＝-[y·log(D(G(z)))+(1-y)·log(1-D(G(z)))]

where y represents a real label, and D (G (z)) represents a determination result of the action information corresponding to the text by the discriminator.