CN110496377B

CN110496377B - Virtual table tennis player ball hitting training method based on reinforcement learning

Info

Publication number: CN110496377B
Application number: CN201910763946.6A
Authority: CN
Inventors: 李桂清; 曾繁忠; 黎子聪; 吴自辉; 聂勇伟
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2019-08-19
Filing date: 2019-08-19
Publication date: 2020-07-28
Anticipated expiration: 2039-08-19
Also published as: CN110496377A

Abstract

The invention discloses a virtual table tennis player batting training method based on reinforcement learning, which comprises the following steps: 1) designing a task scene and a task flow; 2) training a batting strategy of the racket by using a reinforcement learning method; 3) estimating the motion condition of each joint when the human body hits the ball by using an algorithm of inverse kinematics; 4) the mobility policy of the root node is trained using reinforcement learning. The invention can obtain a virtual player which can hit the ball with reasonable posture and high accuracy by designing a simple reward function under the condition of no training data, does not need to design a complex hitting rule, and simultaneously can stably keep a high frame rate for hitting the ball by the virtual player due to the low consumption characteristic of strengthening learning forward operation, so that a user has good interactive experience.

Description

Virtual table tennis player ball hitting training method based on reinforcement learning

Technical Field

The invention relates to the field of virtual reality and reinforcement learning, in particular to a virtual table tennis player batting training method based on reinforcement learning.

Background

Virtual reality is one of the subjects of intensive research in the computer field, and in recent years, with the advent and development of virtual reality equipment such as HTCVive, Oculus and the like, the development of virtual reality technology has reached a new height, and the application of virtual reality is infinite, so at present, virtual reality has been widely applied to multiple fields such as military affairs, education, entertainment and the like. With the cost reduction and civilization of virtual reality equipment, people have more and more contact times and deeper contact degree to virtual reality application, and have higher and higher requirements on the quality of the virtual reality application, and people not only want to see that the perceived virtual scene is different from a real scene, but also want to have more interactive freedom with the virtual scene and obtain feedback as real as possible. Virtual characters can provide a user with a sense of immersion through actions made in a virtual scene, and such sense of immersion and reality derives from the intelligence and reasonableness of the actions taken by the virtual characters in the virtual scene, and for virtual applications where the characters are human, one would like that the behavior decisions and actions of the virtual human are as similar as possible to those of a real human.

Intelligent avatars, which can be defined as avatars that can act autonomously in a virtual environment or that can generate feedback on environmental changes, are one of the subjects of research in the field of artificial intelligence. For an intelligent virtual role, the core is its action policy, and according to the action policy, the state of the environment where the intelligent virtual role is located is input, and the action to be taken can be output. There are many related works to design action strategies for intelligent virtual roles, and the mainstream algorithms can be divided into two categories, one is rule-based method, and the other is machine learning-based method. The rule-based method means that the action strategy of the intelligent virtual role is artificially customized, and comprises an action strategy design method based on the intelligent virtual role such as logic, a state machine and a strategy tree. The main problem with rule-based methods is that it is thought to be difficult to set rules when faced with more complex problems. For example, designing a virtual character shooting in a maze needs to consider how to move, when to shoot, and where to shoot in a complex maze, and making a rule to make the virtual character complete such a complex decision needs to design a very complex character logic, and also greatly increases the amount of calculation. With the development of machine learning, when the design problem of a complex virtual character is faced, the machine learning method can simplify and process the problem to a great extent. Reinforcement learning belongs to a branch of machine learning, which performs well when dealing with well-defined tasks.

In the design task of the virtual table tennis player, the rule-based method needs to design complex batting rules, the design difficulty is high, and the operation cost is high, while the method based on simulation learning or supervised learning needs to collect training data for training, so the training cost is high.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide an effective, scientific and reasonable virtual table tennis player hitting training method based on reinforcement learning.

In order to achieve the purpose, the technical scheme provided by the invention is as follows: a virtual table tennis player hitting training method based on reinforcement learning comprises the following steps:

1) designing a task scene and a task flow;

2) training a batting strategy of the racket by using a reinforcement learning method;

3) estimating the motion condition of each joint when the human body hits the ball by using an algorithm of inverse kinematics;

4) the mobility policy of the root node is trained using reinforcement learning.

In the step 1), designing a task scene to model a virtual table tennis player, building a virtual table tennis court at Unity3D, and setting the size of the court, the position and size of a table tennis table, the height of a net, the size of a table tennis ball, the size of a table tennis bat collision bounding box and the origin of a world coordinate system;

the design task flow specifically comprises the following steps: when the table tennis touches a wall, a floor or the bat hits the table tennis to the table top or the net at the end B of the table, the round is finished, and the hit ball is determined to be failed; when the bat strikes a table tennis ball on the table top at the A end of the table, the turn is over, and the striking is considered to be successful; when the turn is over, the table tennis bat is reset to the initial position to wait for the next turn to start.

In step 2), training a batting strategy by using a neural network based on a reinforcement learning method, and comprising the following steps:

2.1) design observations

The observation refers to data collected by the virtual character in a virtual environment, and the observation of the batting strategy training of the racket is set to be 4 three-dimensional vectors { p_ball,v_ball,p_ballmat,r_ballmatIn which p is_ballIs the position of the table tennis ball, v_ballSpeed of table tennis, p_ballmatPosition of table tennis bat, r_ballmatThe result of dividing the rotation angle of the table tennis bat by 360 degrees is observed for 1 time per frame, and the observation collected for every 3 frames is used as the input of a batting strategy training network of the table tennis bat;

2.2) design behavior

The behavior is estimated according to the observation data, the behavior of the virtual character is trained according to the batting strategy of the racket, and the behavior is a 9-dimensional vector { T }_ballbat,R_ballbatS, C, F }, wherein T is_ballbatStandardizing each component of the three-dimensional translation amount of the racket to be between 0 and 1, and respectively multiplying the three-dimensional translation amount by a weight coefficient w for controlling the moving speed of the racket in three directions_x、w_yAnd w_zThe reasonable motion speed of the racket is ensured; r_ballbatThe rotation angle vectors of the racket around three coordinate axes are multiplied by weight coefficients w respectively when in actual output_u，w_vAnd w_w(ii) a S is the time for determining the moment for entering the batting preparation action, and when S is more than zero, the racket enters the batting preparation action; c is a ball hitting action, when C is larger than 0, the ball is hit by using a forehand ball hitting action, and when C is smaller than 0, the ball is hit by using a backhand ball hitting action; f, determining the hitting force, and applying a force C to the table tennis along the face direction when the bat collides with the table tennis_F+w_FF，C_FBasic force of impact, w_FIs the weight of F, in order to allow F to control the striking force more accurately, w_FShould not be too large, set C_F＝-0.4+0.2×Z_d，Z_dA value in the z direction for a desired impact drop point;

2.3) design reward function

The reward function for the shot strategy training is set as follows:

R_bat＝w_limitR_limit+w_goalR_goal+w_supportR_support

the reward function comprises three items of content, namely constraint, target and auxiliary;

R_limitas a function for constraining the behavior of the racquet, unreasonable situations that should not occur are penalized:

R_{positionlimit}as a function of the range of motion of the racquet, for limiting the range of motion of the racquet:

R_actionlimitfor constraining whether the racket enters the preparatory movement as a function of whether the racket enters the preparatory movement:

w_limitis R_limitA weight;

R_goalthe method belongs to a target driving function and is used for driving a role to complete a game target:

w_goalis R_goalA weight;

R_supportis an auxiliary function, and drives the racket to complete the batting task through a series of prior knowledge:

R_support＝R_hit+R_angle+R_height+R_droppoint

R_hitas a function of the behavior of the hit ball, it is ensured that when the racket hits the ball, a positive feedback is given whether the ball can be hit back to the opponent's table or not, and that when the racket does not hit the ball, a negative feedback is given:

R_angleas a function of the angle of attack of the table tennis ball, for measuring the angle of attack of the table tennis ball:

when the table tennis is contacted with the bat, the normal vector of the bat surface of the table tennis is projected to the x-z plane,

the instantaneous speed of the table tennis ball contacting the table tennis bat is projected to an x-z plane;

R_heightas a function of the ping-pong ball striking height and elevation angle, for measuring the ping-pong ball striking height and elevation angle:

h is the height of the table tennis ball contacting the table tennis bat, n_yThe projection of the normal direction of the table tennis surface on the y-axis direction when the table tennis ball contacts the table tennis bat;

R_droppointas a function of the table tennis ball drop point, for measuring the table tennis ball drop point:

R_droppoint＝1-|pz-Z_g|

pz is the Z-direction value of the table tennis ball drop point, Z_gAt the place back of the middle of the opposite table surface, when the ball falls on Z_gWhen the device is nearby, the device is not easy to go out of bounds and fall into the net;

w_supportis R_supportA weight;

2.4) design network and training parameters

Setting the input of a neural network as a 36-dimensional vector and the output as a 9-dimensional vector, wherein the whole neural network comprises 4 hidden layers and 1 output layer, and each hidden layer comprises 512 neurons; after the network is established, a near-end strategy optimization algorithm, namely a PPO algorithm, is used for training the neural network to obtain a batting strategy.

In the step 3), the motion condition of each joint is estimated by using an algorithm of inverse kinematics, so that the unreasonable stretching and twisting of the posture can not occur, and the method comprises the following steps:

3.1) simplifying the skeleton of the three-dimensional human Body model by using a whole Body skeleton inverse dynamics algorithm, namely a Full Body Biped IK algorithm, wherein the simplified skeleton has 14 joints which are respectively a crotch, a head, left and right legs, left and right shanks, left and right soles, left and right big arms, left and right small arms and left and right palms, and the left and right shoulders, the left and right legs, the left and right feet and the left and right hands all comprise reactors;

3.2) binding the handle part of the racket with the end reactor of the right hand, using Full Body Biped IK algorithm to regard the right arm as a joint chain when the racket moves, and solving each joint point position on the joint chain of the right arm by using FABRIK algorithm for solving inverse kinematics problem in an iterative way;

3.3) using Full Body Biped IK algorithm to correspondingly adjust the positions of all joint points of the Body in a small range according to the change of the right hand arm, thereby solving and obtaining the positions of all joint points;

and 3.4) calculating the motion situation of each joint point of the three-dimensional human Body model under the condition that the right hand holds the racket and the root node is not moved by using a Full Body double IK algorithm.

In step 4), because the root node is not moved by the inverse kinematics, the movement of the root node is controlled by using a root node movement strategy based on reinforcement learning, and the overall human body posture is more reasonable by matching the inverse kinematics, which comprises the following steps:

4.1) design Observation

Set the observations of the root node mobility strategy training as 6 three-dimensional vectors { p }_agent,r_agent,n_spine,p_ref,r_ref,v_refIn which p is_agentIs the position of the phantom, r_agentIs the orientation, n, of the manikin_spineIs the direction of the human spine, p_refIs the position of the racket r_refIs the orientation v of the racket_refIs the instantaneous speed of the racket, each frame is observed for 1 time, and the observation collected for every 3 frames is used as a root node moving strategy training networkInputting;

4.2) design behavior

The behavior of the training for setting the root node mobility strategy is a 3-dimensional vector t_x,t_z,r_yWhere t is_xAnd t_yRepresenting the movement of the phantom in the x-axis and z-direction, r, respectively_yThen the rotation of the manikin about the y-axis, t_x、t_zAnd r_yAre all automatically standardized to [ -1,1 [)]Within the interval, the weight coefficients are multiplied before output

And

4.3) design reward function

The reward function of the root node mobile strategy training is set as follows:

R_move＝w_plimitR_plimit+w_leaveR_leave+w_poseR_pose+w_diviationR_diviation

R_plimitas a function of the range of motion of the mannequin to limit the range of motion of the mannequin:

w_plimitis R_plimitA weight;

R_leavepreventing a racquet from losing hands by measuring the distance of the racquet tip from the hand as a function of the distance of the racquet tip from the hand:

d is the distance between the handle and the palm of the racketIon, p_handAs three-dimensional coordinates of the palm, p_batIs the three-dimensional coordinate of the racket handle;

w_leaveis R_leaveA weight;

R_poseas a function of the ball striking position, for measuring the plausibility of the ball striking position:

R_forehandand R_backhandThe cos α represents the included angle between the connecting line of the holding clap and the root node and the unit vector in the x direction under the local coordinate system of the three-dimensional human body model, and p is the reward function corresponding to the forehand batting and the backhand batting respectively_handThree-dimensional world coordinates for hand-held clappers, p_rootIs the three-dimensional world coordinates of the root node,

is a unit vector (1, 0, 0) in the x direction under the SMP L human model local coordinate system;

w_poseis R_poseA weight;

R_diviationformulating a reward as a function of the offset of the manikin spine, according to the offset of the manikin spine:

cos β defines the offset of the human spine,

is the three-dimensional coordinate of the human model neck under the current action,

is the three-dimensional coordinate of the root node of the human model under the current action,

is the three-dimensional coordinates of the manikin neck under the initial motion,

three-dimensional coordinates of a root node of the human body model under the initial action are obtained;

w_diviationis R_diviationA weight;

4.4) design network and training parameters

Setting the input of a neural network as a 54-dimensional vector and the output as a 3-dimensional vector, wherein the whole neural network comprises 3 hidden layers and 1 output layer, and each hidden layer comprises 512 neurons; after the network is established, a near-end strategy optimization algorithm, namely a PPO algorithm, is used for training the neural network to obtain a root node movement strategy.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. the virtual player designed by the invention can complete the ball hitting task with reasonable ball hitting action and higher ball hitting success rate, rarely has the conditions of net falling and falling on the own desktop, and can accurately hit the flying-to-ground table tennis back to the desktop of the other side.

2. The invention does not need to design a complex batting rule, and has smaller design difficulty and low operation cost.

3. The invention can obtain a virtual player which can hit the ball with reasonable posture and high accuracy by designing a simple reward function under the condition of no training data without collecting data in advance.

4. The virtual player designed by the invention can stably keep a high frame rate for hitting the ball because of the low consumption characteristic of the reinforcement learning forward operation, and has good interactive experience for users.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a task scenario design diagram of the present invention.

Fig. 3 is a schematic view of a racquet being too far forward resulting in a hand drop.

Fig. 4 is a schematic diagram of a reasonable position of the racket.

Fig. 5 is a schematic diagram of a hitting strategy training network according to the present invention.

FIG. 6 is a schematic view of a mannequin inserted into a table too far forward.

Fig. 7 is a schematic view of a ball hitting sequence.

Detailed Description

The present invention will be further described with reference to the following specific examples.

In this embodiment, a virtual table tennis player is designed, which can hit a table tennis ball to an opposite table top with a reasonable ball hitting posture and a high ball hitting success rate, and the main flow is as shown in fig. 1, and the virtual table tennis player ball hitting training method based on reinforcement learning includes the following steps:

1) designing task scene and task flow

Designing a task scene, modeling a virtual table tennis player by using an SMP L algorithm, and building a virtual table tennis course on Unity3D, wherein the size of the course is 8m × 16m as shown in (a) in FIG. 2, the 4 surfaces of the course are provided with walls with the height of 4m, a table tennis table is arranged in the middle of the course, the size of the table tennis table is 2.74m × 1.525.525 m × 0.76.76 m as shown in (b) in FIG. 2, the height of a net is 0.1525m, the size of a table tennis ball is 0.04m, the size of a racket is 0.158m × 0.152.152 m, the size of a racket collision bounding box is 0.16m × 0.1m × 0.18.18 m, and the origin of a world coordinate system is the center of the course.

Designing a task flow: at the beginning of the turn, the virtual serve will serve the ball from the random position of the table a end to the table B end at a randomly assigned speed, and when the table tennis touches the wall, floor or the table bat hits the table or net at the table B end with the table tennis, the turn ends and the shot is deemed to have failed. When the bat strikes a table tennis ball onto the table top at the A end of the table, the turn is over and the stroke is deemed to be successful. When the turn is over, the table tennis bat is reset to the initial position to wait for the next turn to start.

2) Training a hitting strategy of the racket by using a reinforcement learning method, comprising the following steps of:

2.1) design observations

The observation refers to data collected by the virtual character in a virtual environment, and the observation of the batting strategy training of the racket is set to be 4 three-dimensional vectors { p_ball,v_ball,p_ballmat,r_ballmatIn which p is_ballIs the position of the table tennis ball, v_ballSpeed of table tennis, p_ballmatPosition of table tennis bat, r_ballmatThe observation is performed 1 time per frame after the rotation angle of the table tennis bat is divided by 360 degrees, and the observation collected for every 3 frames is used as the input of a batting strategy training network of the table tennis bat.

2.2) design behavior

The behavior is estimated according to the observation data, the behavior of the virtual character is trained according to the batting strategy of the racket, and the behavior is a 9-dimensional vector { T }_ballbat,R_ballbatS, C, F }, wherein T is_ballbatStandardizing each component of the three-dimensional translation amount of the racket to be between 0 and 1, and respectively multiplying the three-dimensional translation amount by a coefficient w for controlling the moving speed of the racket in three directions_x、w_yAnd w_zEnsuring the relatively reasonable motion speed of the racket, and taking w_x＝0.25、w_y0.07 and w_z＝0.07；R_ballbatThe rotation angle vectors of the racket around three coordinate axes are multiplied by weight coefficients w respectively when in actual output_u，w_vAnd w_wTaking w_u＝1.5、w_v2 and w_w0.5; s is the timing for determining the putting-in preparation action, when S is larger than zero, the racket is put in preparation actionMaking; c is a ball hitting action, when C is larger than 0, the ball is hit by using a forehand ball hitting action, and when C is smaller than 0, the ball is hit by using a backhand ball hitting action; f, determining the hitting force, and applying a force C to the table tennis along the face direction when the bat collides with the table tennis_F+w_FF，C_FBasic force of impact, w_FIs the weight of F, in order to allow F to control the striking force more accurately, w_FShould not be too large, take w_F＝1，C_F＝-0.4+0.2×Z_d，Z_dTaking the value of the desired hitting point in the Z direction_d0.8m, i.e. C_F＝1.2。

2.3) design reward function

R_bat＝w_limitR_limit+w_goalR_goal+w_supportR_support

R_limitfor constraining the behaviour of the racket, penalizing unreasonable situations that should not occur:

R_{positionlimit}for limiting the range of motion of the racket:

since the racket is connected with the tail end of the arm, if the racket is too far forward and exceeds the reachable range of the arm, the racket can twist the arm and even fall off the hand, as shown in fig. 3; it is the position limiting function that is used to limit the range of motion of the racquet, and in the present invention, it is considered reasonable as long as the z-axis coordinate of the racquet does not exceed-0.9, and the position where z is-0.9 is shown in FIG. 4, so that a penalty is triggered only when z > -0.9, i.e., the racquet crosses the red line of FIG. 4, R_{positionlimit}Each frame triggers one detection, and the unified settlement is carried out after the turn is finished;

R_actionlimitfor restraining whether the racket enters a preparatory action:

when the bat collides with the table tennis ball, R needs to be triggered once_actionlimitDetecting, and if the preparation action is not carried out during collision, punishing;

w_limitis R_limitWeight, set to 100;

according to the rules of table tennis, when the table tennis is not successfully hit, the table tennis is hit to a net, the table tennis is hit to the boundary of the own square of the table, the table tennis is hit out of the boundary and bounced on the table top of the own square for 0 time or 2 times or more when the table tennis is hit, the hitting is considered as failed, and when the table tennis is hit to the boundary of the table top of the other side under the condition that the table tennis is bounced on the table top of the own square for 1 time, the hitting is considered as successful, R is_goalPerforming 1 detection at the end of the round;

w_goalis R_goalWeight, set to 2;

R_support＝R_hit+R_angle+R_height+R_droppoint

R_hitensure that when the racket hits the ball, no matter whether the ball can be hit back to the opposite table or not, positive feedback is given, and when the racket does not hit the ball, negative feedback is given:

R_anglefor measuring the striking angle of a table tennis:

R_heightfor measuring the ping-pong ball striking height and elevation angle:

R_droppointthe method is used for measuring the drop point of the table tennis:

R_droppoint＝1-|pz-Z_g|

pz is the Z-direction value of the table tennis ball drop point, Z_gThe position of the middle and the back of the opposite desktop is set to be 0.8;

w_supportis R_supportThe weight is set to 1.

2.4) design network and training parameters

Setting the input of the neural network as a 36-dimensional vector and the output as a 9-dimensional vector, wherein the whole neural network comprises 4 hidden layers and 1 output layer, and each hidden layer comprises 512 neurons, as shown in fig. 5; after the network is established, a near-end strategy optimization (PPO) algorithm is used for training the neural network to obtain a batting strategy.

3) The method for estimating the motion condition of each joint when a human body hits the ball by using an inverse kinematics algorithm comprises the following steps:

3.1) using a Full Body inverse dynamics (Full Body Biped IK) algorithm to simplify the skeleton of the three-dimensional human model, wherein the simplified skeleton has 14 joints which are respectively a crotch, a head, left and right legs, left and right crus, left and right soles, left and right big arms, left and right small arms and left and right palms, and the left and right shoulders, the left and right legs, the left and right feet and the left and right hands all comprise reactors.

3.2) binding the handle part of the racket and the end reactor of the right hand together, using Full Body Biped IK algorithm to regard the right arm as a joint chain when the racket moves, and solving each joint point position on the joint chain of the right arm by using FABRIK algorithm which solves inverse kinematics problem in an iterative way.

3.3) using Full Body Biped IK algorithm to correspondingly adjust the positions of all the joint points of the Body in a small range according to the change of the right hand arm, thereby solving and obtaining the positions of all the joint points.

4) Training a mobility strategy of a root node by using reinforcement learning, comprising the following steps:

4.1) design Observation

Set the observations of the root node mobility strategy training as 6 three-dimensional vectors { p }_agent,r_agent,n_spine,p_ref,r_ref,v_refIn which p is_agentIs the position of the phantom, r_agentIs the orientation, n, of the manikin_spineIs the direction of the human spine, p_refIs the position of the racket r_refIs the orientation v of the racket_refThe instantaneous speed of the racket is observed for 1 time per frame, and the observation collected for every 3 frames is used as the input of a root node movement strategy training network.

4.2) design behavior

The behavior of the training for setting the root node mobility strategy is a 3-dimensional vector t_x,t_z,r_yWhere t is_xAnd t_zRepresenting the movement of the phantom in the x-axis and z-direction, r, respectively_yThen the rotation of the manikin about the y-axis, t_x、t_zAnd r_yAre all automatically standardized to [ -1,1 [)]Within the interval, the weight coefficients are multiplied before output

And

get

And

4.3) design reward function

R_move＝w_plimitR_plimit+w_leaveR_leave+w_poseR_pose+w_diviationR_diviation

R_plimitFor limiting the range of motion of the manikin:

when the position of the human body model is too far forward, the human body model may collide with the ping-pong table or may even be embedded into the ping-pong table, as shown in fig. 6, the edge position of the ping-pong table is z-1.5, so when the position of the human body in the z-axis direction is greater than-1.5, the human body model receives a large penalty, R_plimitCarrying out detection once per frame;

w_plimitis R_plimitWeight, set to 100;

R_leavethe racket is prevented from dropping by measuring the distance between the tail end and the hand part of the racket:

d is the distance between the handle and the palm of the racket, p_handAs three-dimensional coordinates of the palm, p_batIs the three-dimensional coordinate of the racket handle;

w_leaveis R_leaveWeight, set to 10;

R_posemeasuring the reasonability of the batting posture:

R_forehandand R_backhandThe cos α represents the included angle between the connecting line of the hand-held clap and the root node and the unit vector in the x direction under the local coordinate system of the three-dimensional human body model, when the hand-held clap is on the right side of the body, the cos α is more than 0, when the hand-held clap is on the left side of the body, the cos α is more than 0, p is p_handThree-dimensional world coordinates for hand-held clappers, p_rootIs the three-dimensional world coordinates of the root node,

a unit vector (1, 0, 0) in the x direction under a human body model local coordinate system;

w_poseis R_poseA weight, set to 1;

R_diviationand (3) formulating a reward according to the offset of the spine of the human body model:

cos β defines the offset of the human spine,

w_diviationis R_diviationThe weight is set to 1.

4.4) design network and training parameters

Setting the input of a neural network as a 54-dimensional vector and the output as a 3-dimensional vector, wherein the whole neural network comprises 3 hidden layers and 1 output layer, and each hidden layer comprises 512 neurons; after the network is built, a near-end strategy optimization (PPO) algorithm is used for training the neural network to obtain a root node movement strategy.

The method is proved to be feasible through experiments, when a flying table tennis ball is faced, the table tennis ball can be hit to the boundary of the opposite table top by the racket according to a strategy obtained through training in a reasonable motion track, meanwhile, the moving racket is bound with the palm of the right hand of the human body model, then the positions of all joints of the human body are solved by using a reverse kinematics method, and then the root nodes are moved by matching with a root node motion strategy obtained through reinforcement learning training, so that the whole ball hitting action is completed. As shown in fig. 7, the action sequence of two ball hitting actions (forehand hitting and backhand hitting) is performed after the bat hits the table tennis ball, and the ball hitting is completed, in fig. 7, (a) line 1 shows the bat sequence of the forehand hitting, and in fig. 7, (b) line 1 shows the bat sequence of the backhand hitting; then, the motion information of each joint point of the human body is solved by using inverse kinematics, wherein (a) line 2 in fig. 7 shows the action sequence of the virtual player hitting the ball with the front hand in the static state, and (b) line 2 in fig. 7 shows the action sequence of the virtual player hitting the ball with the back hand in the static state; and then, the reinforcement learning-based root node motion algorithm is used for moving the root node, so that the virtual player keeps reasonable hitting action during the motion, wherein (a) the third line in fig. 7 shows the action sequence that the virtual player positively hits the ball in the motion state, and (b) the third line in fig. 7 shows the action sequence that the virtual player reversely hits the ball in the motion state.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that variations based on the shape and principle of the present invention should be covered within the scope of the present invention.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereto, so that the changes in the shape and principle of the present invention should be covered within the protection scope of the present invention.

Claims

1. A virtual table tennis player hitting training method based on reinforcement learning is characterized by comprising the following steps:

1) designing a task scene and a task flow;

2) the method based on reinforcement learning, which uses a neural network to train the batting strategy, comprises the following steps:

2.1) design observations

2.2) design behavior

The behavior is estimated according to the observation data, the behavior of the virtual character is trained according to the batting strategy of the racket, and the behavior is a 9-dimensional vector { T }_ballbat,R_ballbatS, C, F }, wherein T is_ballbatStandardizing each component of the three-dimensional translation amount of the racket to be between 0 and 1, and respectively multiplying the three-dimensional translation amount by a weight coefficient w for controlling the moving speed of the racket in three directions_x、w_yAnd w_zThe reasonable motion speed of the racket is ensured; r_ballbatThe rotation angle vectors of the racket around three coordinate axes are multiplied by weight coefficients w respectively when in actual output_u，w_vAnd w_w(ii) a S is the time for determining the moment for entering the batting preparation action, and when S is more than zero, the racket enters the batting preparation action; c is a ball hitting action, when C is larger than 0, the ball is hit by using a forehand ball hitting action, and when C is smaller than 0, the ball is hit by using a backhand ball hitting action; f, determining the hitting force, and applying a force C to the table tennis along the face direction when the bat collides with the table tennis_F+w_FF，C_FBasic force of impact, w_FFor the weight of F, set C_F＝-0.4+0.2×Z_d，Z_dA value in the z direction for a desired impact drop point;

2.3) design reward function

The reward function for the shot strategy training is set as follows:

R_bat＝w_limitR_limit+w_goalR_goal+w_supportR_support

R_limitas a function of the behavior of the racket for constraining, no coincidence should occurPenalizing:

w_limitis R_limitA weight;

w_goalis R_goalA weight;

R_support＝R_hit+R_angle+R_height+R_droppoint

R_angleis a table tennisA function of ball strike angle for measuring the ping-pong ball strike angle:

when the table tennis is contacted with the bat, the normal vector of the bat surface projects to an x-z plane,

R_droppoint＝1-|pz-Z_g|

w_supportis R_supportA weight;

2.4) design network and training parameters

Setting the input of a neural network as a 36-dimensional vector and the output as a 9-dimensional vector, wherein the whole neural network comprises 4 hidden layers and 1 output layer, and each hidden layer comprises 512 neurons; after the network is established, training a neural network by using a near-end strategy optimization algorithm, namely a PPO algorithm, so as to obtain a batting strategy;

2. The virtual table tennis player ball-hitting training method based on reinforcement learning of claim 1, wherein: in the step 1), designing a task scene to model a virtual table tennis player, building a virtual table tennis court at Unity3D, and setting the size of the court, the position and size of a table tennis table, the height of a net, the size of a table tennis ball, the size of a racket collision bounding box and the origin of a world coordinate system;

3. The virtual table tennis player ball-hitting training method based on reinforcement learning of claim 1, wherein: in the step 3), the motion condition of each joint is estimated by using an algorithm of inverse kinematics, so that the unreasonable stretching and twisting of the posture can not occur, and the method comprises the following steps:

4. The virtual table tennis player ball-hitting training method based on reinforcement learning of claim 1, wherein: in step 4), because the root node is not moved by the inverse kinematics, the movement of the root node is controlled by using a root node movement strategy based on reinforcement learning, and the overall human body posture is more reasonable by matching the inverse kinematics, which comprises the following steps:

4.1) design Observation

Set the observations of the root node mobility strategy training as 6 three-dimensional vectors { p }_agent,r_agent,n_spine,p_ref,r_ref,v_refIn which p is_agentIs the position of the phantom, r_agentIs the orientation, n, of the manikin_spineIs the direction of the human spine, p_refIs the position of the racket r_refIs the orientation v of the racket_refThe instantaneous speed of the racket is observed for 1 time per frame, and the observation collected for every 3 frames is used as the input of a root node movement strategy training network;

4.2) design behavior