CN110516389B

CN110516389B - Behavior control strategy learning method, device, equipment and storage medium

Info

Publication number: CN110516389B
Application number: CN201910820695.0A
Authority: CN
Inventors: 孙明飞; 石贝; 付强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2021-04-13
Anticipated expiration: 2039-08-29
Also published as: CN110516389A

Abstract

The application discloses a learning method, a device, computer equipment and a storage medium of a behavior control strategy, wherein the method comprises the following steps: sampling a demonstration behavior data segment comprising at least two demonstration behavior data from the demonstration behavior data sequence; setting initial state information of each joint of a target object simulated in a physical simulator according to the demonstration behavior data segment, and determining acting force data of each joint of the target object by using a neural network model to be trained; controlling the motion of each joint of the simulated target object in the physical simulator, so that the physical simulator simulates a simulated behavior data sequence of the target object based on the set action behavior limiting characteristics; determining action behavior difference according to the demonstration behavior data and the simulation behavior data; and optimizing the neural network model until an optimization target is reached based on the action behavior difference degree. The scheme of the application is beneficial to demonstrating that the learning object generates the expanded action behavior based on the demonstration action.

Description

Behavior control strategy learning method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for learning a behavior control policy.

Background

Demonstration learning is an autonomous learning technique targeting a demonstration behavior in which an object to be learned in a skill is required to mimic the behavior of a demonstration so that the object can obtain a motor skill corresponding to the demonstration behavior. Wherein the objects to be learned for a skill may also differ in different application areas. For example, in the field of games, the objects to be learned with skills may be characters, animals, etc. in the game; as another example, in the field of robot control, the object to be learned in skill may be a robot.

At present, in the demonstration learning process, a behavior control strategy can be learned from a plurality of demonstration examples through various machine learning algorithms, and then the object in the actual application environment can be subjected to behavior control based on the behavior control strategy, so that the object can obtain action behaviors corresponding to the demonstration examples.

However, in the existing demonstration learning process, if an object of a skill to be learned is expected to have a certain motor skill, action demonstration data corresponding to the motor skill needs to be obtained in advance; if the corresponding action demonstration data is lacked, the corresponding motor skills cannot be provided for the object, and the complexity of generating a certain skill by the object to be learned is high. For example, if a character in a game is expected to have a motor skill to walk with a box carried thereon, it is necessary to use demonstration data in which a real person walks with a box carried thereon in advance.

Disclosure of Invention

In view of this, the present application provides a method, an apparatus, a device and a storage medium for learning a behavior control policy, so as to facilitate an object for demonstration learning to learn a behavior different from a demonstration action, thereby reducing the complexity of the skill of the object for learning the behavior.

To achieve the above object, in one aspect, the present application provides a method for learning a behavior control policy, including:

sampling a demonstration behavior data segment serving as a training sample from a demonstration behavior data sequence, wherein the demonstration behavior data segment comprises at least two demonstration behavior data with a sequence, and the demonstration behavior data comprises first state information of each joint of a demonstration object;

setting initial state information of each joint of a simulated target object in a physical simulator according to the demonstration behavior data segment, and determining acting force data acting on each joint of the target object by using a neural network model to be trained, wherein the target object and the demonstration object have the same joint;

controlling the motion of each joint of the simulated target object in the physical simulator based on the acting force data of each joint of the target object determined by the neural network model, so that the physical simulator simulates a simulated behavior data sequence of the target object based on a set action behavior limiting feature, wherein the simulated behavior data sequence comprises at least one piece of simulated behavior data with a sequence order, the simulated behavior data comprises second state information of each joint of the target object, and the action behavior limiting feature is used for limiting a feature required to be met by the action behavior of the simulated target object;

determining the action behavior difference degree between the simulated target object and the demonstration object according to the first state information of each joint of the demonstration object in the demonstration behavior data and the second state information of each joint of the target object in the simulation behavior data;

and optimizing the behavior control strategy expressed by the neural network model based on the action behavior difference until an optimization target is reached, and determining the behavior control strategy expressed by the neural network model as a control strategy according to which demonstration learning is carried out.

In another aspect, the present application further provides a behavior control policy learning apparatus, including:

the data sampling unit is used for sampling a demonstration behavior data segment serving as a training sample from a demonstration behavior data sequence, wherein the demonstration behavior data segment comprises at least two pieces of demonstration behavior data with a sequence, and the demonstration behavior data comprises first state information of each joint of a demonstration object;

the model control unit is used for setting initial state information of each joint of a simulated target object in the physical simulator according to the demonstration behavior data segment, and determining acting force data acting on each joint of the target object by using a to-be-trained neural network model, wherein the target object and the demonstration object have the same joint;

the data simulation unit is used for controlling the motion of each joint of the simulated target object in the physical simulator based on the acting force data of each joint of the target object determined by the neural network model, so that the physical simulator simulates a simulated behavior data sequence of the target object based on a set action behavior limiting feature, the simulated behavior data sequence comprises at least one piece of simulated behavior data with a sequence order, the simulated behavior data comprises second state information of each joint of the target object, and the action behavior limiting feature is used for limiting a feature required to be met by the action behavior of the simulated target object;

the difference comparison unit is used for determining the action behavior difference degree between the simulated target object and the demonstration object according to the first state information of each joint of the demonstration object in the demonstration behavior data and the second state information of each joint of the target object in the simulation behavior data;

and the training optimization unit is used for optimizing the behavior control strategy expressed by the neural network model based on the action behavior difference degree until an optimization target is reached, and determining the behavior control strategy expressed by the neural network model as a control strategy according to which demonstration learning is carried out.

In yet another aspect, the present application further provides a computer device, including:

a processor and a memory;

the processor is used for calling and executing the program stored in the memory;

the memory is configured to store the program, the program at least to:

controlling the motion of each joint of the simulated target object in the physical simulator based on the acting force data of each joint of the target object determined by the neural network model, so that the physical simulator simulates a simulated behavior data sequence of the target object based on a set action behavior limiting feature, wherein the simulated behavior data sequence comprises at least one piece of simulated behavior data with a sequence order, the simulated behavior data comprises second state information of each joint of the target object, and the action behavior limiting feature is used for limiting a feature required to be met by the action behavior of the simulated target object; determining the action behavior difference degree between the simulated target object and the demonstration object according to the first state information of each joint of the demonstration object in the demonstration behavior data and the second state information of each joint of the target object in the simulation behavior data;

In yet another aspect, the present application further provides a storage medium having stored therein computer-executable instructions, which when loaded and executed by a processor, implement the learning method of behavior control policy according to any one of the above.

Through the technical scheme, the behavior control strategy required by the demonstration learning in the application is expressed through the neural network model. The training of the behavior control strategy expressed by the neural network model is completed through the cooperation of the neural network model and the physical simulator, in addition, in the process of training the neural network model, in addition to the combination of demonstration behavior data, action behavior limiting characteristics corresponding to an object of which behavior skill is to be learned are set in the physical simulator, and characteristic requirements which need to be met by the behavior characteristics of the target object simulated in the physical simulator can be limited through the action behavior limiting characteristics, so that the behavior control strategy expressed by the trained neural network model can enable the target object to generate other action behaviors which are similar to the demonstration behavior data as much as possible and accord with the set action behavior limiting characteristics. Therefore, when the neural network model obtained through training is used for controlling the behavior learning of the target object, the method can be beneficial to the target object to learn the action behavior similar to the demonstration behavior data and the action behavior corresponding to the demonstration behavior data is not identical, namely, other similar action behaviors can be expanded, the target object can learn the action behavior different from the demonstration behavior of the demonstration action behavior data based on the demonstration action behavior data, so that the behavior control strategy of the corresponding action behavior can be obtained under the condition that the demonstration behavior data of a certain action behavior does not exist, the target object can be controlled to learn the action behavior similar to but different from the demonstration behavior based on the behavior control strategy, and the complexity of the demonstration learning is reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on the provided drawings without creative efforts.

FIG. 1a shows a schematic diagram of the joints and their states of a presentation object in a demonstration study;

FIG. 1b shows a schematic structural diagram of the joints of an object demonstrating the technique to be learned in learning;

FIG. 2 is a schematic diagram of a computer device to which a behavior control strategy learning method of the present application is applied;

FIG. 3 is a flow chart illustrating an embodiment of a behavior control strategy learning method of the present application;

FIG. 4 is a flow diagram illustrating a further embodiment of a method for learning a behavior control strategy according to the present application;

FIG. 5 illustrates an architectural diagram of one implementation principle of the learning method of the behavior control strategy of the present application;

FIG. 6 is a flow chart illustrating an application scenario of the learning method for behavior control strategy according to the present application;

fig. 7 is a schematic diagram illustrating a configuration of an embodiment of a learning apparatus for a behavior control strategy according to the present application.

Detailed Description

The scheme of the application is suitable for demonstration learning, and demonstration objects and objects to be learned behavior skills are involved in the demonstration learning. The demonstration object is used for demonstrating behaviors so as to generate demonstration behavior data according to which demonstration learning is carried out. And the object to be learned with the behavioral skills is the object finally learned with the corresponding behavioral skills based on the demonstration behavior data. For example, the object may be a robot, or a game object in a game.

For example, in the field of games, the object to be learned with skills may be a game character in the game. In this case, it is possible to obtain the demonstration behavior data from the motion (e.g., the motion such as walking, jumping, etc.) demonstrated by the real user, and to perform reinforcement learning on the game character in the game based on the demonstration behavior data so that the game character can have the skill of the demonstrated motion (e.g., the motion such as walking, jumping, etc.).

Currently, in the demonstration learning process, the demonstration behavior data is generally expressed as the states of the respective joints of the demonstration object, which may include angles, velocities (including velocities of the respective reference directions), and the like of the respective joints. The object to be learned with the skill has the same joint as the demonstration object, and the degrees of freedom of the respective joints are also the same.

As shown in fig. 1a, it shows the various joints that the presentation object contains and the states of the various joints.

In fig. 1a, a presentation object is taken as an example, and in fig. 1a, the presentation object is shown to include various joints of a human body, such as a knee joint, an elbow joint, a wrist joint, and the like.

At the same time, the demonstration behavior data may reflect the state in which the various joints of the demonstration object in FIG. 1 are. Such as the angle and velocity of each joint in three-dimensional space. For example, in a set three-dimensional space, with an X axis, a Y axis, and a Z axis that are perpendicular to each other, angles of each joint of the presentation object with respect to the three axes, and the like can be found based on the presentation behavior data of the presentation object.

Accordingly, in order to enable a target object of a skill to be learned to demonstrate learning of the corresponding skill based on the demonstration behavior data of the demonstration object, the target object of the skill to be learned should have the same joint as the demonstration object. Of course, the degrees of freedom of the respective joints are also the same. As shown in fig. 1b, it is a schematic structural diagram of a target object for performing demonstration learning based on the demonstration behavior data of the demonstration object shown in fig. 1 a. As can be seen from fig. 1b, the target object is also a human body, and the joints and the degrees of freedom of the target object are the same as those of the presentation object.

It is understood that fig. 1a and 1b are only examples of the demonstration object and the target object to be learned with the behavioral skill being a human, and in practical applications, if the target object is in other forms, it is required that the demonstration object has the same joint as the target object, for example, if the target object is a robot in an animal form (e.g., a robotic cat), the demonstration object may be an animal (e.g., a cat), and the like.

It is understood that, in the demonstration learning process, it is necessary to determine a behavior control strategy for controlling the target object based on the demonstration behavior data and then control the movement of the target object based on the behavior control strategy so that the target object can learn behavior skills similar to the demonstration behavior.

However, the inventors have found through research that: the behavior control strategy determined in the existing demonstration learning process can only enable a target object to learn a behavior basically consistent with the demonstration behavior, but cannot learn other behaviors similar to the demonstration behavior but expanded, so that the behavior skill which can be learned through the demonstration learning is limited, and the behavior can be learned only under the condition of demonstration data with certain behavior, so that the complexity of the demonstration learning is higher, and the flexibility is poorer.

Based on the research, the scheme of the application can train a behavior control strategy suitable for expanding demonstration behaviors based on the demonstration behavior data.

The scheme of the application is suitable for computer equipment which can be a personal computer, a server and other electronic equipment with data processing capability.

For example, referring to fig. 2, a schematic structural diagram of a computer device to which the learning method of the behavior control policy of the embodiment of the present application is applied is shown. In fig. 2, the computer device 200 may include: a processor 201, a memory 202, a communication interface 203, an input unit 204, and a display 205 and communication bus 206.

The processor 201, the memory 202, the communication interface 203, the input unit 204 and the display 205 are all communicated with each other through a communication bus 206.

In the embodiment of the present application, the processor 201 may be a Central Processing Unit (CPU), an application-specific integrated circuit (ASIC), a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic devices.

The processor may call a program stored in the memory 202, and in particular, the processor may perform the operations performed by the computer device of fig. 3-6.

The memory 202 is used for storing one or more programs, which may include program codes including computer operation instructions, and in the embodiment of the present application, the memory stores at least the programs for implementing the following functions:

and optimizing the behavior control strategy expressed by the neural network model based on the action behavior difference until an optimization target is reached, and determining the behavior control strategy expressed by the neural network model as a control strategy according to which the demonstration learning is based.

In one possible implementation, the memory 202 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, the above-mentioned programs, and application programs required by at least one function (such as a sound playing function, an image playing function, a positioning function, and the like); the storage data area may store data created during use of the computer device, such as audio data, a phone book, etc.

Further, the memory 202 may include a high-speed random access memory, and may also include a nonvolatile memory or the like.

The communication interface 203 may be an interface of a communication module, such as an interface of a GSM module.

The present application may further include an input unit 205, which may include a touch sensing unit, a keyboard, and the like.

The display 204 includes a display panel, such as a touch display panel or the like.

Of course, the computer device structure shown in fig. 2 does not constitute a limitation of the computer device in the embodiment of the present application, and in practical applications, the computer device may include more or less components than those shown in fig. 2, or some components may be combined.

The following describes a learning method of the behavior control strategy according to the present application with reference to a flowchart.

As shown in fig. 3, which shows a flowchart of a learning method of a behavior control strategy according to the present application, the solution of the present embodiment can be applied to the aforementioned computer device, and the method includes:

s301, sampling demonstration behavior data segments serving as training samples from the demonstration behavior data sequence.

Wherein the sequence of presentation behavior data comprises a plurality of consecutive presentation behavior data at different time instants. The demonstration behavior data segment belongs to a part of continuous data segments in the demonstration behavior data sequence, correspondingly, the demonstration behavior data segment comprises at least two demonstration behavior data with a sequence, namely, the demonstration behavior data segment comprises demonstration behavior data of two adjacent moments. The presentation behavior data includes state information for various joints of the presentation object.

The state information of the joints can represent the specific states presented by the joints, the motion states of the joints can be reflected through the state information, and further the action behaviors of the demonstration object are reflected through the state information of each joint. For example, the state information of the joint includes one or more of state values such as an angle and a speed of the joint. For the sake of convenience of distinguishing from the state information of each joint in the subsequent simulation, the state information of the joint of the presentation object is referred to as first state information.

It will be appreciated that there are a variety of ways to obtain the sequence of presentation behavior data, such as, in one possible implementation, capturing, by the motion capture device, presentation data that is presented by a presentation object after the presentation object has demonstrated an action, the presentation data being available as a sequence of presentation behavior data; or processing the demonstration data to obtain the demonstration behavior data sequence. Of course, the demonstration behavior data sequence obtained by other methods is also applicable to the embodiment, and is not limited to this.

It will be appreciated that sampling the sequence of demonstration behavior data may result in the samples required for training the behavior control strategy. The specific manner of sampling the demonstration behavior data segment as the training sample from the demonstration behavior data sequence may be various. For example, a piece of data may be sampled randomly at a time from the sequence of demonstration behavior data as a training sample. Of course, it is also possible to sample a plurality of demonstration behavior data segments from the demonstration behavior data sequence at a time, but each training period, only one demonstration behavior data segment is used to train the neural network.

S302, setting initial state information of each joint of a simulated target object in the physical simulator according to the demonstration behavior data segment, and determining acting force data acting on each joint of the target object by using a neural network model to be trained.

The physical simulator is also called a physical engine, and is a simulation program for simulating the motion of the intelligent agent.

In the embodiment of the application, the intelligent agent which can be simulated in the physical simulator is the target object, and meanwhile, the physical simulator can simulate the stress and motion conditions of the target object in a real physical space.

The target object is an object to be learned with behavioral skills, for example, taking a game application as an example, the target object may be a game object such as a game character in the game application. As can be seen from the foregoing, the target object has the same joints as the presentation object.

In the embodiment of the present application, the behavior control strategy is expressed by a neural network model, and therefore, the behavior control strategy for controlling each joint of the target object can be obtained by training the neural network model. The behavior control strategy may be characterized by forces output by the neural network model for the respective joints of the target object.

It can be understood that, in order to enable the target object simulated in the physical simulator to learn the demonstration behavior corresponding to the demonstration behavior data, it is necessary to first set the initial state information of each joint of the target object in the physical simulator based on the demonstration behavior data in the demonstration behavior data segment, so that the initial action behavior of the target object simulated in the physical simulator is consistent with the first or middle action behavior of the demonstration object in the demonstration behavior segment.

As an optional manner, in order to enable the physical simulator to simulate each demonstration behavior corresponding to the target object learning demonstration behavior data segment, initial state information of each joint of the target object simulated in the physical simulator may be set according to first state information of each joint of the demonstration object in the first demonstration behavior data in the demonstration behavior data segment. In this case, the state information of each joint of the target object in the physical simulator is consistent with the first state information of the corresponding joint of the presentation object contained in the first presentation behavior data in the presentation behavior data segment.

Correspondingly, the first state information of each joint of the demonstration object in the first demonstration behavior data can be input into the neural network model to be trained, and the acting force data which is output by the neural network model and is used for controlling each joint of the target object is obtained. In the present application, the training of the neural network model is completed through the interaction between the neural network model and the physical simulator, and therefore, the neural network model needs to predict the acting force condition of each joint required by the target object to learn the demonstration behavior corresponding to the demonstration behavior data based on the input demonstration behavior data. Since the target object needs to have the same joint as the presentation object, the neural network model may be regarded as force data of each joint of the target object (or the target object simulated in the physical simulator) and may also be regarded as force data corresponding to each joint of the presentation object.

The force data of the joint can be one or more of force data acting on the joint, such as magnitude, direction and duration of control force acting on the joint.

The neural network model can be set as required, and as an optional mode, the neural network model can be a deep neural network model.

And S303, controlling the motion of each joint of the target object simulated in the physical simulator based on the acting force data of each joint determined by the neural network model, so that the physical simulator simulates a simulated behavior data sequence based on the set action behavior limiting characteristics.

The simulated behavior data sequence includes at least one simulated behavior data including second state information of each joint of the target object.

The state information of each joint of the target object simulated by the physical simulator can also reflect the simulated action behavior of the target object, such as the angle, the speed and other numerical values of each joint of the simulated target object. For the sake of convenience of distinction, the state information of the joint of the simulated target object is referred to as second state information.

It can be understood that, when the initial state information of each joint of the target object in the physical simulator is determined, the acting force data of each joint output by the neural network model is input into the physical simulator, so that the physical simulator can simulate the acting force applied to each joint of the target object, thereby simulating the action change of each joint of the target object under the condition that each joint of the target object has the corresponding acting force, and obtaining the simulated state information of each joint of the target object.

It will be appreciated that each time a force is applied to a respective joint of a target object simulated in a physical simulator, there will be a change in state information for the respective joint of the target object, thereby simulating a simulated behavior data for the target object.

The physical simulator may also continuously interact with the neural network model to simulate a plurality of simulated behavioral data. For example, according to the number of the demonstration behavior data included in the demonstration behavior data segment or in combination with actual needs, multiple interactions between the physical simulator and the neural network model may be set, that is, the acting force data output by the neural network model is updated in combination with the simulation behavior data simulated by the physical simulator, and the acting force data output by the neural network model is applied again to the target object simulated by the physical simulator, and the process is repeated continuously, so that a series of simulation behavior data may be simulated, and a simulation behavior data sequence including at least one simulation behavior data may be obtained.

In particular, the physical simulator of the present application is further provided with an action behavior definition feature, and the action behavior definition feature is used for defining the feature required to be met by the action behavior of the simulated target object. That is, the physical simulator is configured with action behavior requirements that are additionally satisfied for limiting the learning action behavior of the target object.

For example, the action defining feature may be an object that needs to be carried by the configuration simulation in the process of performing the action on the target object, for example, the target object needs to carry a box.

For another example, the action behavior defining feature may be a manner of defining the simulated action behavior of the target object, e.g., the target object needs to be constantly transformed in action.

As another example, the action behavior defining feature may be to define that the target object needs to learn the action behavior while controlling the movement of a particular item.

It can be understood that, under the condition that the physical simulator is provided with the action behavior limiting feature, the action behavior of the target object to be simulated finally satisfies the principle through the interaction between the neural network model and the physical simulator: and on the premise that the simulated action behavior of the target object is similar to the action behavior corresponding to the demonstration behavior data of the demonstration object as much as possible, enabling the action behavior of the target object to accord with the action behavior limiting characteristics.

For example, the following steps are carried out:

the action behavior demonstrated by the demonstration object is assumed to be walking action, and the purpose of demonstration learning is to enable the target object to learn the action behavior of walking with articles. In this case, the action behavior limiting feature provided in the physical simulator may be that the object is moving.

S304, determining the action behavior difference degree between the simulated target object and the demonstration object according to the first state information of each joint of the demonstration object in the demonstration behavior data and the second state information of each joint of the target object in the simulation behavior data.

And the action behavior difference degree is used for reflecting the comprehensive difference situation between the first state information of each joint of the demonstration object and the simulated second state information of each joint of the target object. It can be seen that the comprehensive difference condition is actually a difference degree between the simulated action behavior of the target object and the action behavior of the demonstration object.

The specific manner of determining the action behavior difference degree may be set as required, for example, each piece of simulation behavior data simulated by the physical simulator is obtained by learning the first state information of each joint corresponding to each piece of demonstration behavior data in the demonstration behavior data segment, so that the corresponding simulation behavior data and demonstration behavior data are determined according to the sequence of each piece of simulation behavior data in the simulation behavior data sequence and the sequence of each piece of demonstration behavior data in the demonstration behavior data segment. For each pair of simulated behavior data and demonstration behavior data, state difference values corresponding to each joint in the target object and the demonstration object can be respectively calculated according to the first state value of each joint in the target object in the simulated behavior data and the second state value of each joint in the demonstration object in the demonstration behavior data, for example, the Euclidean distance between the target object and each joint of the demonstration object is calculated. Then, the action behavior difference degree can be determined according to the average value of all the state difference values

S305, based on the action behavior difference degree, optimizing the behavior control strategy expressed by the neural network model until an optimization target is reached, and determining the behavior control strategy expressed by the neural network model as a control strategy according to demonstration learning.

It can be understood that the degree of difference between the simulated action behavior of the target object and the action behavior demonstrated by the demonstration object can be reflected by the degree of difference between the action behaviors, and therefore, the degree of difference between the action behaviors can be used as a parameter for optimizing the neural network model.

The behavior control strategy expressed by the optimized neural network model is essentially to adjust internal parameters of the neural network model so as to change the behavior control strategy expressed by the neural network model.

As an optional mode, the demonstration learning and the reinforcement learning algorithm can be combined, and correspondingly, the excitation signal can be determined according to the action behavior difference degree and the reinforcement learning algorithm; internal parameters in the neural network model are adjusted according to the excitation signal.

It can be understood that the optimization target can be set as required, the demonstration behavior data is indicated when the optimization target is reached, and the similarity between the action behavior of the target object simulated in the physical simulator and the demonstration behavior of the demonstration object is controlled to meet the requirement through the behavior control strategy output by the neural network model. For example, in an alternative implementation, the optimization goal may be that the action-behavior difference degree is the minimum value, that is, the action-behavior difference degree is smaller than the action-behavior difference degree determined before the current time. The optimization objective may also be that the determined change amplitude of the action behavior difference degree is smaller than a set value.

If it is determined that the optimization target is not reached currently based on the currently determined action behavior difference, the behavior control strategy expressed by the neural network model needs to be optimized based on the action behavior difference, and meanwhile, the sampled training sample needs to be used for continuously training the neural network model. If the optimization goal has not been reached, the training needs to be continued, for example, if a plurality of demonstration behavior data segments are sampled in step S301, the demonstration behavior data segments not used for training can be selected to continue the operations of steps S302 to S305. Alternatively, in the case that only one piece of demonstration behavior data is sampled from the demonstration behavior data sequence as a training sample each time, the method may return to step S301 to sample a piece of demonstration behavior data from the demonstration behavior data sequence as a training sample again, and continue to perform the operations of steps S302 to S305 until the optimization goal is reached.

Accordingly, if it is determined that the optimization goal is reached, the learning (or training) may be ended, and the trained neural network model may be used as a behavior control strategy for the target object in the real scene.

As an alternative, after the neural network model is obtained through training, the neural network model may also be loaded into the target application program, so as to control the action behavior of the target object controlled by the target application program through the behavior control policy expressed by the neural network model. The target application is used for controlling the running of the target object, that is, the target application is a control degree of the target object in the actual application scene, and is not a control program of the simulated target object in the simulation environment.

For example, in the case of demonstration learning in the game field, after the neural network model is trained, the neural network model may be loaded into a game application to control the action behavior of a game object in the game application based on the neural network model. For example, the current action behavior of the game object is input into the neural network model, and the motion of each joint of the game object is controlled based on the acting force data of each joint of the game object output by the neural network model, so that the game object can obtain the action behavior similar to the demonstration object and conforming to the behavior restriction feature.

Through the technical scheme, the behavior control strategy required by the demonstration learning in the application is expressed through the neural network model. The training of the behavior control strategy expressed by the neural network model is completed through the cooperation of the neural network model and the physical simulator, in addition, in the process of training the neural network model, in addition to the combination of demonstration behavior data, action behavior limiting characteristics corresponding to an object of which behavior skill is to be learned are set in the physical simulator, and characteristic requirements which need to be met by the behavior characteristics of the target object simulated in the physical simulator can be limited through the action behavior limiting characteristics, so that the behavior control strategy expressed by the trained neural network model can enable the target object to generate other action behaviors which are similar to the demonstration behavior data as much as possible and accord with the set action behavior limiting characteristics.

Therefore, when the neural network model obtained through training is used for controlling the behavior learning of the target object, the method can be beneficial to the target object to learn the action behavior similar to the demonstration behavior data and the action behavior corresponding to the demonstration behavior data is not identical, namely, other similar action behaviors can be expanded, the target object can learn the action behavior different from the demonstration behavior of the demonstration action behavior data based on the demonstration action behavior data, so that the behavior control strategy of the corresponding action behavior can be obtained under the condition that the demonstration behavior data of a certain action behavior does not exist, the target object can be controlled to learn the action behavior similar to but different from the demonstration behavior based on the behavior control strategy, and the complexity of the demonstration learning is reduced.

For ease of understanding, the scheme of the present application is described below by taking a process of training a neural network model by deep reinforcement learning as an example. In this case, deep reinforcement learning is combined with demonstration learning, and according to the specific task requirement of the target object of the behavior skill to be learned, the behavior action limiting feature is set so as to train to obtain the action behavior which is suitable for the target object to learn the action form of the demonstration object and meets the specific requirement.

Referring to fig. 4, which shows a flowchart of another embodiment of the learning method of a behavior control policy according to the present application, the present embodiment is also applied to the aforementioned computer device, and the method of the present embodiment may include:

s401, randomly sampling a section of demonstration behavior data segment from the obtained demonstration behavior data sequence.

For example, the presentation behavior data in a continuous time period is randomly extracted as the presentation behavior data segment, and the presentation behavior data segment includes the presentation behavior data corresponding to the presentation object at least two continuous time points. The presentation behavior data also includes first state values for various joints of the presentation object.

It is understood that, the above step S401 is described by taking an example of sampling a demonstration behavior data segment as a training sample, but the present embodiment is also applicable to other cases.

S402, setting initial state information of each joint of the simulated target object in the physical simulator according to first state information of each joint of the demonstration object in first demonstration behavior data in the demonstration behavior data segment.

For example, the initial state information of each joint of the target object in the physical simulator is respectively kept consistent with the first state information of the joint in the demonstration object in the first demonstration behavior data, so that the initial state of the target object in the physical simulator is set, and the physical simulator can simulate the demonstration actions corresponding to the target object learning the second demonstration behavior data segment and the subsequent demonstration behavior data segment subsequently.

And S403, inputting the first state information of each joint of the demonstration object in the first demonstration behavior data into a neural network model to be trained, and obtaining acting force data which is output by the neural network model and is used for controlling each joint of the target object.

S404, based on the initial state information of each joint of the target object simulated in the physical simulator, acting force is applied to each joint of the target object simulated in the physical simulator according to the acting force data of each joint of the target object determined by the neural network model, so that the physical simulator simulates one piece of simulated behavior data of the target object based on the set action behavior limiting characteristics.

It can be understood that, in the case that the initial state information of each joint of the target object in the physical simulator is determined, the force is applied to each joint of the target object, so that the state of each joint in the target object is changed once, and a piece of simulated behavior data is obtained, wherein the simulated behavior data includes the second state information of each joint of the target object.

It is understood that the simulated behavior data simulated in step S404 is the state information of each joint to the target object simulated according to the acting force output by the neural network model in the case of the initial state of each joint of the target object, and therefore, the simulated behavior data represents the action behavior learned by the target object in the physical simulator for the second demonstration behavior data in the demonstration behavior segment.

S405, detecting whether the total amount of the simulation behavior data meets a set condition, if so, confirming to obtain a simulation behavior data sequence containing at least one simulation behavior data, and executing step S408; if not, step S406 is performed.

The setting condition may be set as needed, for example, if the demonstration learning is performed for a set number of pieces of demonstration behavior data in the demonstration behavior data segment, the setting condition may be that the total number reaches the set number.

Optionally, it may be set that the physical simulator needs to simulate the demonstration actions corresponding to all the demonstration behavior data in the demonstration behavior data segments by the target object, and therefore, the set condition may be that the total amount of the simulation behavior data is consistent with the amount of the demonstration behavior data in the demonstration behavior data segments; alternatively, the total amount of presentation behavior data exceeds the amount of presentation behavior data in the presentation behavior data segment. Here, it should be noted that, if the initial state information of each joint of the target object in the physical simulator is also determined as one simulation behavior data simulated by the physical simulator, the setting condition may be: the total amount of the simulated behavior data is consistent with the amount of the demonstration behavior data in the demonstration behavior data segment. If the initial state information of each joint of the target object in the physical simulator is not determined as one piece of simulation behavior data simulated by the physical simulator, the total amount of the simulation behavior data is only required to be the same as the amount of the demonstration behavior data in the demonstration behavior data segment minus 1.

It is understood that, if the total number of the simulated behavior data satisfies the set condition, the simulated at least one simulated behavior data is determined as the simulated behavior data sequence. It will be appreciated that if the initial state information for each joint of the target object in the physical simulator is also determined to be one simulated behavior data simulated by the physical simulator, then the sequence of simulated behavior data should include at least two simulated behavior data.

And S406, inputting the simulation behavior data of the target object which is simulated by the physical simulator for the last time into the neural network model to obtain the updated acting force data of each joint of the target object.

S407, applying an acting force to each joint of the target object simulated in the physical simulator according to the updated acting force data of each joint of the target object, so that the physical simulator simulates the simulated behavior data of the target object based on the set action behavior limiting feature, and returning to step S405 until the total amount of the simulated behavior data satisfies the set condition.

In steps S406 and S407, the neural network model updates the acting force data required to be applied to each joint of the target object based on the simulation behavior data simulated by the physical simulator, and controls the physical simulator to continue simulating the motion of each joint of the target object until a plurality of simulation behavior data are obtained.

For example, assuming that the demonstration behavior data segment includes 5 continuous demonstration behavior data, after the initial state information of each joint of the target object in the physical simulator is set based on the first demonstration behavior data in the demonstration behavior data segment, so that the physical simulator obtains the first simulation behavior data, the physical simulator simulates second simulation behavior data corresponding to the second demonstration behavior data through step S404, and then third to fifth simulation behavior data corresponding to the third to fifth demonstration behavior data in the demonstration behavior data segment are obtained through three times of repeated execution of steps S406 and S407, so as to obtain a simulation behavior data sequence including five simulation behavior data.

S408, determining the action behavior difference degree between the simulated target object and the demonstration object according to at least two demonstration behavior data in the demonstration behavior data segment and at least one simulation behavior data in the simulation behavior sequence.

It can be understood that, since the first demonstration behavior data in the demonstration behavior data segment is consistent with the initial state information of each joint of the target object in the physical simulator, it is only necessary to compare the demonstration behavior data after the first demonstration behavior data in the demonstration behavior data segment with the simulated behavior data simulated after the initial state information of each joint in the physical simulator.

Of course, if the initial state information of each joint of the target object in the physical simulator is also determined as one simulation behavior data simulated by the physical simulator, the physical simulator outputs at least two simulation behavior data, and the demonstration behavior data and the simulation behavior data corresponding to the sequence can be sequentially compared according to the sequence corresponding relationship.

S409, detecting whether the action difference reaches a convergence state according to the action difference and action difference determined before the current moment, and if not, executing the step S410; if so, the training is ended.

The convergence status can be understood as a convergence status conventionally set in reinforcement learning, such as the aforementioned optimization objectives, which are not described in detail herein.

And S410, determining an excitation signal according to the action behavior difference degree.

It will be appreciated that reinforcement learning is the use of a highly simulated physics engine and reinforcement signals to train an agent, and that during the training process, the agent continuously interacts with the physics engine using existing strategies to generate a series of reinforcement signals (i.e., stimulus signals) that are used to update the strategy. In this embodiment, the strategy is expressed by a neural network model, and the agent is a target object simulated in the physics engine, so that an excitation signal for updating the strategy in the neural network model can be determined according to the action behavior difference degree.

Wherein, the larger the difference degree of the action behaviors is, the smaller the excitation signal is; conversely, the smaller the degree of difference in the action behavior, the larger the excitation signal.

S411, adjusting internal parameters in the neural network model according to the excitation signal to change the behavior control strategy expressed by the neural network model, and returning to step S401 to resample the demonstration action behavior segment.

It will be appreciated that the goal to be achieved by continuously optimizing the neural network model is that the modeled target object can generate an action behavior as close as possible to the presentation data, and the optimization problem can be expressed as follows:

min|τ-τ_El, and follows that h (τ) is less than or equal to 0, g (τ) is 0;

wherein, tau_EAnd tau is simulated behavior data of the simulated target object obtained by final optimization for demonstrating the behavior data, wherein the simulated behavior data comprises second state information of each joint of the simulated target object. h (τ) ≦ 0 and g (τ) ≦ 0 indicate two setting manners for setting different behavior restriction characteristics, e.g., h (τ) ≦ 0 may be a behavior characteristic that may not be possible in some case. And g (τ) ═ 0 can be equal to the behavior characteristic of the action that can only be executed in a certain case.

It follows that the nature of the optimization problem is to generate optimization data, i.e., τ, that satisfies the behavior constraints specified by the behavior constraints and is as similar as possible to the demonstration behavior data.

Correspondingly, the generated optimized tau is used as a learning target, an excitation function is defined, and after a physical simulator carries out a large amount of simulation, the determined excitation signal can be used for updating a neural network model expressing a behavior control strategy.

To facilitate an intuitive understanding of the learning method of the behavior control strategy of the present application, reference may be made to fig. 5, which shows a schematic diagram of a framework of implementation principles of the method of the present application.

As can be seen from fig. 5, after the demonstration behavior data is sampled from the demonstration behavior data sequence, the demonstration behavior data is input into the neural network model, and the neural network model outputs acting force data for controlling each joint corresponding to the target object simulated in the physical simulator based on the demonstration behavior data, so that the physical simulator can simulate the behavior of the target object based on the action behavior characteristics and output simulated behavior data of the simulated target object. The simulated behavior data includes simulated state information of the joints of the target object. The behavior difference degree between the demonstration object and the target object can be determined by comparing the simulation behavior data with the sampled demonstration behavior data, so that the neural network model can be optimized based on the behavior difference degree until convergence is achieved, the simulation behavior data output by the physical simulator is close to the corresponding demonstration behavior data, and the action behavior represented by the simulation behavior data conforms to the action behavior characteristics.

To facilitate an understanding of the benefits of the present solution, reference is made below to an application scenario.

The demonstration learning of the game character in the game application is taken as an example, and it is assumed that the game character is required to generate the article moving walking based on the walking motion demonstrated by the real user. In this case, the learning method of the behavior control policy of the embodiment may be as shown in fig. 6, where fig. 6 may be applied to a computer device, and the process may include:

s601, obtaining a demonstration data sequence of walking motion demonstrated by a real user.

In this embodiment, the demonstration learning is taken as an example, so that the game character in the game application can learn the behavior of the real user, and therefore, the demonstration data sequence is data of the walking motion demonstrated by the real user. Specifically, the presentation data sequence includes: first state values of respective joints of a real user at a plurality of different times.

It is understood that the present embodiment is exemplified by requiring the game character to learn the walking motion of the real user, but if the motion to be learned by the game character is other motion, only the demonstration data sequence of the corresponding motion demonstrated by the real user or the demonstration object having the same joint with the game character needs to be obtained. For example, if the game character needs to learn the action of heel-turning, the demonstration data sequence is replaced with the demonstration data sequence of heel-turning of the demonstration of the real user.

S602, randomly sampling a section of demonstration data segment from the demonstration data sequence.

S603, according to the first demonstration data in the demonstration data segment, setting initial state information of each joint of the game character in the physical simulator, and obtaining first simulation behavior data of the game task in the physical simulator.

The steps S602 and S603 are still described by taking a way of sampling training samples as an example, but other sampling ways are also applicable to the present embodiment.

S604, inputting the first demonstration data into a neural network model to be trained, and obtaining acting force data of each joint of the game character to be simulated, which is output by the neural network model.

And S605, controlling the motion of each joint of the game character simulated in a physical simulator according to the acting force data of each joint of the game character output by the neural network model, so that the physical simulator simulates second state information of each joint of the game character to obtain second simulated behavior data of the simulated target object when the game character is walking on a portable object based on the set walking characteristics of the portable object.

It can be understood that, since the game character is required to obtain the action of walking with the object based on the walking action extended learning demonstrated by the real user, the action behavior control feature configured in the physical simulator is the characteristic that the target object walks with the object (such as a box). Correspondingly, the physical simulator can simulate the walking process of the game character carrying object according to the acting force data of each joint input by the neural network model, and therefore the simulated behavior data of the walking of the game character carrying object obtained through simulation is output. The simulated behavior data includes second state information of the respective joints of the game character.

It will be appreciated that the configuration behavior action defining features in the simulation controller may be different if it is desired to extend other walking action related action behaviors based on the walking action of the real user. If the game character is required to learn the motion skill of walking with constantly changing postures according to the normal walking motion of the real user, the motion behavior limiting characteristics configured in the physical simulator can be as follows: the walking postures of the game characters at adjacent moments are different. Of course, this embodiment is taken as an example of a scene of learning walking action, and if the action to be learned is other cases, the action behavior limiting feature in the physical simulator may be set according to the action behavior demonstrated by the demonstration object and the specific action behavior that the game character needs to expand.

S606, inputting the simulation behavior data of the game character most recently simulated by the physical simulator into the neural network model to obtain updated acting force data of each joint of the game character, applying acting force to each joint of the game character simulated in the physical simulator according to the updated acting force data of each joint of the game character so that the physical simulator simulates the simulation behavior data of the game character based on the set action behavior limiting characteristics, and repeating the step S606 until the total number of the simulated behavior data is consistent with the total number of the demonstration behavior data in the demonstration behavior data segment.

The step S606 can refer to the related description of the previous embodiment, and is not described herein again.

S607, determining the action and behavior difference between the simulated game character and the real user according to each simulation behavior data in the simulation behavior data sequence simulated by each demonstration behavior data in the demonstration behavior data segment.

S608, detecting whether the difference degree of the currently determined action reaches the minimum value according to the difference degree of the action and the action determined before the current moment, and if not, executing the step S609; if so, the training is ended.

In the present embodiment, the optimization target is taken as an example where the degree of difference in the behavior of the operation is minimized, but the present embodiment is also applicable to the case where the optimization target is other.

And S609, determining an excitation signal according to the action behavior difference degree.

S610, adjusting internal parameters in the neural network model according to the excitation signal to change the behavior control strategy expressed by the neural network model, and returning to the step S602 to resample a demonstration data sequence segment and resample demonstration data serving as a training sample.

It is understood that after confirming that the optimization goal is reached, the neural network model is trained, and on the basis, the action behavior of the game character in the game application can be controlled based on the control strategy expressed by the neural network model, so that the game character can learn the action behavior of walking with the article.

Specifically, the trained neural network model may be loaded into a game application, where the game character carries an article. In this case, the game application may acquire state information of each joint of the game character and input the state information of each joint of the game character into the neural network model; then, the game application may control the movement of each joint of the game character based on the power of each joint of the game character output by the neural network model, so that the game character can generate an action of walking with an article.

According to the embodiment, the neural network model corresponding to the behavior control strategy required for controlling the game character to walk with the article can be trained based on the walking action of the real user, so that the game character in the game application can be subjected to the action control based on the neural network model, the game character can learn walking actions similar to the walking actions demonstrated by the real user, and the motor skills for walking with the article can be expanded on the basis of the walking actions demonstrated by the real user.

Through the test, the scheme of this application can make the game personage obtain the action of the best transport article walking, can walk for a long time steadily, has realized the effect that current scheme can not reach.

The application also provides a learning device of the behavior control strategy, which corresponds to the learning method of the behavior control strategy.

As shown in fig. 7, which shows a schematic structural diagram of an embodiment of the learning apparatus for behavior control policy according to the present application, the apparatus of the present embodiment may include:

the data sampling unit 701 is configured to sample a demonstration behavior data segment serving as a training sample from a demonstration behavior data sequence, where the demonstration behavior data segment includes at least two demonstration behavior data having a sequence, and the demonstration behavior data includes first state information of each joint of a demonstration object;

a model control unit 702, configured to set initial state information of each joint of a target object simulated in a physical simulator according to the demonstration behavior data segment, and determine acting force data acting on each joint of the target object by using a neural network model to be trained, where the target object and the demonstration object have the same joint;

a data simulation unit 703, configured to control, based on the acting force data of each joint of the target object determined by the neural network model, the motion of each joint of the target object simulated in the physical simulator, so that the physical simulator simulates a simulated behavior data sequence of the target object based on a set action behavior limiting feature, where the simulated behavior data sequence includes at least one piece of simulated behavior data in a sequential order, the simulated behavior data includes second state information of each joint of the target object, and the action behavior limiting feature is used to limit a feature that needs to be satisfied by the action behavior of the simulated target object;

a difference comparison unit 704, configured to determine a degree of difference in action behavior between the simulated target object and the demonstration object according to first state information of each joint of the demonstration object in the demonstration behavior data and second state information of each joint of the target object in the simulation behavior data;

and a training optimization unit 705, configured to optimize the behavior control strategy expressed by the neural network model based on the action behavior difference until an optimization target is reached, and determine the behavior control strategy expressed by the neural network model as a control strategy according to which the demonstration learning is based.

In one possible implementation manner, the training optimization unit includes:

the detection subunit is used for detecting whether the action behavior difference degree reaches a set optimization target;

the cyclic training subunit is used for optimizing the behavior control strategy expressed by the neural network model based on the action behavior difference degree if the action behavior difference degree does not reach the set optimization target, and returning to execute the operation of the data sampling unit;

and the ending control subunit is used for confirming that the learning is finished and determining the behavior control strategy expressed by the neural network model as a control strategy according to which the demonstration learning is performed if the action behavior difference reaches a set optimization target.

Optionally, when the training optimization unit or the cyclic training subunit optimizes the behavior control strategy expressed by the neural network model based on the action behavior difference, the training optimization unit or the cyclic training subunit is specifically configured to determine an excitation signal based on the reinforcement learning algorithm according to the action behavior difference; and adjusting internal parameters in the neural network model according to the excitation signal so as to change the behavior control strategy expressed by the neural network model.

In one possible implementation, the model control unit includes:

the simulation initial setting unit is used for setting initial state information of each joint of a simulated target object in the physical simulator according to first state information of each joint of the simulated target object in the first demonstration behavior data in the demonstration behavior data segment;

and the initial force determining unit is used for inputting the first state information of each joint of the demonstration object in the first demonstration behavior data into a neural network model to be trained to obtain acting force data which is output by the neural network model and is used for controlling each joint of the target object.

In another possible implementation manner, the data simulation unit includes:

the simulation control unit is used for applying acting force to each joint of the target object simulated in the physical simulator according to the acting force data of each joint of the target object determined by the neural network model based on the initial state information of each joint of the target object simulated in the physical simulator, so that the physical simulator simulates one piece of simulated behavior data of the target object based on the set action behavior limiting characteristics;

the simulation ending control unit is used for confirming to obtain a simulation behavior data sequence containing at least one simulation behavior data if the total amount of the simulation behavior data meets a set condition;

and the simulation circulating unit is used for inputting the simulation behavior data of the target object which is simulated by the physical simulator for the last time into the neural network model to obtain updated acting force data of each joint of the target object if the total amount of the simulation behavior data does not meet a set condition, and applying acting force to each joint of the target object simulated in the physical simulator according to the updated acting force data of each joint of the target object, so that the physical simulator simulates the simulation behavior data of the target object based on the set action behavior limiting characteristics until the total amount of the simulated simulation behavior data meets the set condition.

Optionally, the apparatus may further include:

and the model application unit is used for loading the neural network model into a target application program after the training optimization unit obtains the behavior control strategy expressed by the neural network model so as to control the action behavior of a target object controlled by the target application program through the behavior control strategy expressed by the neural network model, and the target application program is used for controlling the running of the target object.

In another aspect, the present application further provides a storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are loaded and executed by a processor, the learning method of the behavior control policy as in any one of the above embodiments is implemented.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A method for learning a behavior control strategy, comprising:

controlling the motion of each joint of the simulated target object in the physical simulator based on the acting force data of each joint of the target object determined by the neural network model, so that the physical simulator simulates a simulated behavior data sequence of the target object based on a set action behavior limiting feature, wherein the simulated behavior data sequence comprises at least one piece of simulated behavior data with a sequence order, the simulated behavior data comprises second state information of each joint of the target object, and the action behavior limiting feature is used for limiting features which are required to be additionally met in the process of action behavior of the simulated target object;

2. The method for learning a behavior control strategy according to claim 1, wherein the optimizing the behavior control strategy expressed by the neural network model based on the action behavior difference until an optimization goal is reached comprises:

detecting whether the action behavior difference degree reaches a set optimization target or not;

if the action behavior difference does not reach the set optimization target, optimizing the behavior control strategy expressed by the neural network model based on the action behavior difference, and returning to execute the operation of sampling the demonstration behavior data segment serving as the training sample from the demonstration behavior data sequence;

and if the action behavior difference degree reaches a set optimization target, confirming that the learning is finished.

3. The method for learning a behavior control strategy according to claim 1 or 2, wherein the optimizing the behavior control strategy expressed by the neural network model based on the action behavior difference degree comprises:

determining an excitation signal according to the action and behavior difference degree and based on a reinforcement learning algorithm;

and adjusting internal parameters in the neural network model according to the excitation signal so as to change the behavior control strategy expressed by the neural network model.

4. The method for learning a behavior control strategy according to claim 1, wherein the setting initial state information of each joint of a target object simulated in a physical simulator according to the demonstration behavior data segment, and determining the acting force data acting on each joint of the target object by using a neural network model to be trained comprises:

setting initial state information of each joint of a simulated target object in a physical simulator according to first state information of each joint of the demonstration object in the first demonstration behavior data in the demonstration behavior data segment;

and inputting the first state information of each joint of the demonstration object in the first demonstration behavior data into a neural network model to be trained to obtain acting force data which is output by the neural network model and is used for controlling each joint of the target object.

5. The learning method of the behavior control strategy according to claim 1 or 4, wherein the step of controlling the motion of each joint of the target object simulated in the physical simulator based on the force data of each joint of the target object determined by the neural network model, so that the physical simulator simulates a simulated behavior data sequence of the target object based on the set action behavior defining characteristics comprises the steps of:

based on the initial state information of each joint of the target object simulated in the physical simulator, applying acting force to each joint of the target object simulated in the physical simulator according to the acting force data of each joint of the target object determined by the neural network model, so that the physical simulator simulates one piece of simulated behavior data of the target object based on the set action behavior limiting characteristics;

if the total amount of the simulation behavior data meets a set condition, confirming to obtain a simulation behavior data sequence containing at least one simulation behavior data;

if the total amount of the simulated behavior data does not meet a set condition, inputting the simulated behavior data of the target object, which is simulated by the physical simulator for the last time, into the neural network model to obtain updated acting force data of each joint of the target object, and applying acting force to each joint of the target object simulated in the physical simulator according to the updated acting force data of each joint of the target object, so that the physical simulator simulates the simulated behavior data of the target object based on the set action behavior limiting characteristics until the total amount of the simulated behavior data meets the set condition.

6. The method for learning a behavior control strategy according to claim 1, further comprising, after obtaining the behavior control strategy expressed by the neural network model:

and loading the neural network model into a target application program so as to control the action behavior of a target object controlled by the target application program through a behavior control strategy expressed by the neural network model, wherein the target application program is used for controlling the running of the target object.

7. An apparatus for learning a behavior control strategy, comprising:

the data simulation unit is used for controlling the motion of each joint of the simulated target object in the physical simulator based on the acting force data of each joint of the target object determined by the neural network model, so that the physical simulator simulates a simulated behavior data sequence of the target object based on a set action behavior limiting characteristic, the simulated behavior data sequence comprises at least one piece of simulated behavior data with a sequence order, the simulated behavior data comprises second state information of each joint of the target object, and the action behavior limiting characteristic is used for limiting a characteristic which is required to be additionally met in the process of action behavior of the simulated target object;

8. The behavior control strategy learning device of claim 7, wherein the training optimization unit comprises:

9. A computer device, comprising:

a processor and a memory;

the memory is configured to store the program, the program at least to:

10. A storage medium having stored thereon computer-executable instructions which, when loaded and executed by a processor, implement a method of learning a behaviour control strategy according to any one of claims 1 to 6.