CN110516389A

CN110516389A - Learning method, device, equipment and the storage medium of behaviour control strategy

Info

Publication number: CN110516389A
Application number: CN201910820695.0A
Authority: CN
Inventors: 孙明飞; 石贝; 付强
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-11-29
Anticipated expiration: 2039-08-29
Also published as: CN110516389B

Abstract

This application discloses learning method, device, computer equipment and the storage mediums of a kind of behaviour control strategy, this method comprises: sampling out from demonstration behavioral data sequence includes at least two demonstration behavioral data segments for demonstrating behavioral datas；According to demonstration behavioral data segment, the initial state information in each joint for the target object simulated in physics emulator is set, and utilization neural network model to be trained determines the force data in each joint of target object；The movement in each joint for the target object simulated in control physical simulation device, so that physical simulation device limits feature, the emulation behavioral data sequence of the target object simulated based on the action behavior of setting；According to demonstration behavioral data and emulation behavioral data, action behavior diversity factor is determined；Based on action behavior diversity factor, optimization neural network model is until reach optimization aim.The object that the scheme of the application is conducive to demonstration study generates the action behavior after extension based on demostrating action.

Description

Learning method, device, equipment and the storage medium of behaviour control strategy

Technical field

This application involves field of computer technology more particularly to a kind of learning method, device, the equipment of behaviour control strategy And storage medium.

Background technique

Demonstration study is a kind of using demonstration behavior as the autonomous learning technology of target, in demonstration study, skill to be learned The object of energy is required to imitate the behavior of demonstration, so that the object can obtain motor skill corresponding with demonstration behavior.Its In, in different application field, the object of technical ability to be learned be would also vary from.Such as, in field of play, technical ability to be learned Object can be the personage in game, animal etc.；For another example, in robot control field, the object of technical ability to be learned can be with For robot.

Currently, can be demonstrated in example and be learnt from several groups by the machine learning algorithm of multiplicity in demonstration learning process Behaviour control strategy is obtained, behavior control strategy then can be based on, behavior is carried out to the object in actual application environment Control, so that object can obtain action behavior corresponding with example is demonstrated.

However, in existing demonstration learning process, if it is desired to the object of technical ability to be learned has a certain motor skill, Just need to be obtained ahead of time the corresponding movement demonstration data of the motor skill；If having lacked corresponding movement demonstration data, nothing Method makes object have corresponding motor skill, and the complexity for causing the object of technical ability to be learned to generate a certain technical ability is higher.Example Such as, if it is desired to which the personage in game has the motor skill for removing chest walking, then needs pre- to first pass through true man and remove chest The demonstration data of walking.

Summary of the invention

In view of this, this application provides learning method, device, equipment and the storage medium of a kind of behaviour control strategy, It may learn action behavior different from demostrating action to be conducive to the object of demonstration study, reduce the uncertain plant learning behavior skill The complexity of energy.

To achieve the above object, on the one hand, this application provides a kind of learning methods of behaviour control strategy, comprising:

The demonstration behavioral data segment as training sample, the demonstration behavior are sampled out from demonstration behavioral data sequence Data slot includes at least two demonstration behavioral datas with sequencing, and the demonstration behavioral data includes presentation objects The first state information in each joint；

According to the demonstration behavioral data segment, the first of each joint for the target object simulated in physics emulator is set Beginning status information, and determine to act on the effect in each joint of the target object using neural network model to be trained Force data, the target object and presentation objects joint having the same；

The force data in each joint of the target object determined based on the neural network model, described in control The movement in each joint for the target object simulated in physical simulation device, so that the movement of the physical simulation device based on setting Behavior limits feature, simulates the emulation behavioral data sequence of the target object, and the emulation behavioral data sequence includes tool There is at least one emulation behavioral data of sequencing, the emulation behavioral data includes each joint of the target object Second status information, the action behavior limits to be met needed for the action behavior for the target object that feature is used to limit the simulation Feature；

First state information and emulation row according to each joint of presentation objects in the demonstration behavioral data For second status information in each joint of target object described in data, determine the simulation target object and the demonstration Action behavior diversity factor between object；

Based on the action behavior diversity factor, optimize behaviour control strategy expressed by the neural network model, until Reach optimization aim, by the behaviour control strategy that the neural network model is expressed be determined as demonstration study in based on control Strategy.

Another aspect, present invention also provides a kind of learning devices of behaviour control strategy, comprising:

Data sampling unit, for sampling out the demonstration behavioral data as training sample from demonstration behavioral data sequence Segment, the demonstration behavioral data segment include at least two demonstration behavioral datas with sequencing, the demonstration behavior Data include the first state information in each joint of presentation objects；

Model cootrol unit, for the target simulated in physics emulator to be arranged according to the demonstration behavioral data segment The initial state information in each joint of object, and determine to act on the target pair using neural network model to be trained The force data in each joint of elephant, the target object and presentation objects joint having the same；

Data simulation unit, the work in each joint of the target object for being determined based on the neural network model With force data, the movement in each joint for the target object simulated in the physical simulation device is controlled, so that the physics is imitative True device limits feature based on the action behavior of setting, simulates the emulation behavioral data sequence of the target object, the emulation Behavioral data sequence includes at least one emulation behavioral data with sequencing, and the emulation behavioral data includes the mesh Second status information in each joint of object is marked, the action behavior limits the target object that feature is used to limit the simulation Action behavior needed for meet feature；

Difference comparing unit, for the first state letter according to each joint of presentation objects in the demonstration behavioral data Second status information in each joint of target object described in breath and the emulation behavioral data, determines the mesh of the simulation Mark the action behavior diversity factor between object and the presentation objects；

Training optimization unit optimizes expressed by the neural network model for being based on the action behavior diversity factor The behaviour control strategy that the neural network model is expressed is determined as demonstrating by behaviour control strategy until reaching optimization aim Control strategy based in study.

Another aspect, present invention also provides a kind of computer equipments, comprising:

Processor and memory；

The processor, for calling and executing the program stored in the memory；

The memory is used for storing said program, and described program is at least used for:

The force data in each joint of the target object determined based on the neural network model, described in control The movement in each joint for the target object simulated in physical simulation device, so that the movement of the physical simulation device based on setting Behavior limits feature, simulates the emulation behavioral data sequence of the target object, and the emulation behavioral data sequence includes tool There is at least one emulation behavioral data of sequencing, the emulation behavioral data includes each joint of the target object Second status information, the action behavior limits to be met needed for the action behavior for the target object that feature is used to limit the simulation Feature；First state information and the emulation behavior according to each joint of presentation objects in the demonstration behavioral data Second status information in each joint of target object described in data, determine the simulation target object and it is described demonstration pair Action behavior diversity factor as between；

It is executable to be stored with computer present invention also provides a kind of storage medium for another aspect in the storage medium Instruction when the computer executable instructions are loaded and executed by processor, realizes as above described in any item behaviour control plans Learning method slightly.

It can be seen via above technical scheme that behaviour control strategy needed for demonstration study passes through neural network in the application Model tormulation.Behaviour control expressed by neural network model is completed by the cooperation of neural network model and physical simulation device The training of strategy, moreover, during training neural network model, other than combining and demonstrating behavioral data, also in physics Action behavior limited features corresponding to object in emulator provided with behavior technical ability to be learned are limited special by action behavior Sign can limit the feature requirement met needed for the behavioural characteristic for the target object simulated in physics emulator, so that instruction Behaviour control strategy expressed by the neural network model practised can make target object generation to the greatest extent may be used with demonstration behavioral data Can be similar, and meet other action behaviors of the action behavior limited features of setting again.It follows that the mind obtained based on training When controlling the action learning of target object through network model, it can both be conducive to target object study and arrive and demonstrate behavioral data phase As action behavior and the action behavior not exactly the same with the corresponding action behavior of demonstration behavioral data, it can expand Other similar action behavior, be conducive to target object based on demostrating action behavioral data can learn out with demostrating action row The different action behavior of behavior is demonstrated for data, thus without the demonstration behavioral data of certain action behavior, Also the behaviour control strategy of available corresponding actions behavior, and then target object can be controlled based on behavior control strategy The action behavior similar but different from demonstration behavior is practised out, the complexity of demonstration study is advantageously reduced.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is only embodiments herein, for ability For the those of ordinary skill of domain, without creative efforts, it can also be obtained according to the attached drawing of offer other Attached drawing.

Fig. 1 a shows each joint of presentation objects and its schematic diagram of state in demonstration study；

Fig. 1 b shows the structural schematic diagram in each joint of object of technology to be learned in demonstration study；

Fig. 2 shows the one of a kind of computer equipment that a kind of learning method of behaviour control strategy of the application is applicable in Kind composed structure schematic diagram；

Fig. 3 shows a kind of flow diagram of learning method one embodiment of behaviour control strategy of the application；

Fig. 4 shows a kind of flow diagram of another embodiment of the learning method of behaviour control strategy of the application；

Fig. 5 shows a kind of configuration diagram of realization principle of the learning method of the behaviour control strategy of the application；

Fig. 6 shows a kind of learning method of behaviour control strategy of the application and illustrates applied to a kind of process of application scenarios Figure；

Fig. 7 shows a kind of composed structure schematic diagram of learning device one embodiment of behaviour control strategy of the application.

Specific embodiment

The scheme of the application is suitable for demonstration study, is related to presentation objects and behavior skill to be learned in demonstration study The object of energy.Wherein, presentation objects demonstrate behavioral data based on demonstration study to generate for demonstrating behavior.And wait learn The object of habit behavior technical ability is the object finally learnt based on demonstration behavioral data to corresponding actions behavior technical ability.Such as, the object It can be the game object in robot or game.

Such as, by taking field of play as an example, the object of technical ability to be learned can be the game charater in game.In this kind of situation Under, the movement (such as walk, jump act) that can be demonstrated according to true user obtains and demonstrate behavioral data, and according to drilling Show behavioral data, intensified learning is carried out to the game charater in game, is moved so that game charater can have this demonstrated out The technical ability made and (such as walked, movement of jumping).

Currently, demonstration behavioral data General Expression is the state in each joint of presentation objects in demonstration learning process, The state may include the angle in each joint, speed (speed comprising each reference direction) etc..And pair of technical ability to be learned As with presentation objects joint having the same, the freedom degree in corresponding each joint is also identical.

As shown in Figure 1a, the state in each joint and each joint that are included it illustrates presentation objects.

In fig 1 a by taking presentation objects are behaved as an example, the presentation objects are shown in fig 1 a and contain each pass of human body Section, e.g., knee joint, elbow joint, wrist joint etc..

Meanwhile the demonstration behavioral data can reflect out each joint state in which of presentation objects in Fig. 1.Such as, respectively A joint angle in three dimensions and speed etc..For example, in the three-dimensional space of setting, have orthogonal X-axis, Y-axis and Z axis, each joint of the demonstration available presentation objects of behavioral data based on presentation objects is relative to these three axial directions Angle etc..

Correspondingly, in order to enable the target object of technical ability to be learned based on the demonstration behavioral data of presentation objects Demonstration learns corresponding technical ability out, and the target object of the technical ability to be learned should have joint identical with the presentation objects.When So, the freedom degree in each joint is also identical.It as shown in Figure 1 b, is the demonstration behavior based on presentation objects shown in Fig. 1 a Data carry out the structural schematic diagram of the target object of demonstration study.The target object is similarly human body, mesh it can be seen from Fig. 1 b Joint and the freedom degree for marking object and presentation objects are all the same.

It is understood that Fig. 1 a and Fig. 1 b are only with the behaviour of the target object of presentation objects and behavior technical ability to be learned As an example, in practical applications, needing presentation objects to have and the target object phase if target object is other forms Same joint, for example, target object is the robot (such as Doraemon) of zoomorphism, then presentation objects can be with animal (such as cat) etc..

It is understood that needing to determine control target object based on demonstration behavioral data in demonstration learning process Behaviour control strategy is then based on behaviour control strategy to control the movement of target object, so that target object can learn To behavior technical ability similar with demonstration behavior.

However, inventor has found that: identified behaviour control strategy in existing demonstration learning process, it can only So that target object learns to the behavior almost the same with demonstration behavior, and but process similar to demonstration behavior without calligraphy learning Other behavior acts of extension to limit the behavior technical ability that can learn by demonstrating study, and then are only having In the case where the demonstration data of certain behavior, this kind of behavior just may learn, the complexity for causing demonstration to learn is higher, flexibly Property is poor.

Based on above the study found that the scheme of the application can be trained based on demonstration behavioral data and be suitble to extension demonstration The behaviour control strategy of behavior.

The scheme of the application be suitable for computer equipment, the computer equipment can for personal computer, server and Other have data processing can electronic equipment.

Such as, referring to fig. 2, it illustrates the computers that the learning method of the behaviour control strategy of the embodiment of the present application is applicable in A kind of composed structure schematic diagram of equipment.In Fig. 2, the computer equipment 200 may include: processor 201, memory 202, Communication interface 203, input unit 204 and display 205 and communication bus 206.

Processor 201, communication interface 203, input unit 204, display 205, passes through communication bus at memory 202 206 complete mutual communication.

In the embodiment of the present application, the processor 201 can be central processing unit (Central Processing Unit, CPU), application-specific integrated circuit (application-specific integrated circuit, ASIC), number Signal processor (DSP), specific integrated circuit (ASIC), ready-made programmable gate array (FPGA) or other programmable logic devices Part etc..

The processor can call the program stored in memory 202, specifically, processor can execute subsequent figure 3 to Operation performed by computer equipment in Fig. 6.

For storing one or more than one program in memory 202, program may include program code, described program Code includes computer operation instruction, in the embodiment of the present application, is at least stored in the memory for realizing following functions Program:

The demonstration behavioral data segment as training sample, the demonstration behavior number are sampled out from demonstration behavioral data sequence It include at least two demonstration behavioral datas with sequencing according to segment, which includes each of presentation objects The first state information in joint；

According to the demonstration behavioral data segment, the initial of each joint for the target object simulated in physics emulator is set Status information, and determine to act on the active force number in each joint of the target object using neural network model to be trained According to the target object and presentation objects joint having the same；

Based on the force data in each joint of the determining target object of the neural network model, it is imitative to control the physics The movement in each joint for the target object simulated in true device, so that the physical simulation device is limited based on the action behavior of setting Feature, the emulation behavioral data sequence of the target object simulated, the emulation behavioral data sequence include having sequencing At least one emulation behavioral data, which includes second status information in each joint of the target object, The action behavior limits the feature met needed for the action behavior for the target object that feature is used to limit the simulation；According to the demonstration In behavioral data in the first state information in each joint of presentation objects and the emulation behavioral data target object it is each Second status information in a joint determines the action behavior diversity factor between the target object of the simulation and the presentation objects；

Based on the action behavior diversity factor, optimize behaviour control strategy expressed by the neural network model, until reaching Optimization aim, by the behaviour control strategy that the neural network model is expressed be determined as demonstration study in based on control strategy.

In one possible implementation, which may include storing program area and storage data area, wherein Storing program area can storage program area, above mentioned program and at least one function (such as sound-playing function, Image player function and positioning function etc.) needed for application program etc.；Storage data area can be stored according to computer equipment The data created in use process, for example, audio data, phone directory etc..

In addition, memory 202 may include high-speed random access memory, it can also be including nonvolatile memory etc..

The communication interface 203 can be the interface of communication module, such as the interface of gsm module.

The application can also include input unit 205, which may include touch sensing unit, keyboard etc..

The display 204 includes display panel, such as touch display panel.

Certainly, computer equipment structure shown in Fig. 2 does not constitute the restriction to computer equipment in the embodiment of the present application, Computer equipment may include than more or fewer components shown in Fig. 2, or the certain components of combination in practical applications.

It is introduced below with reference to learning method of the flow chart to the behaviour control strategy of the application.

As shown in figure 3, it illustrates a kind of a kind of flow diagram of the learning method of behaviour control strategy of the application, this The scheme of embodiment can be applied to above-mentioned computer equipment, this method comprises:

S301 samples out the demonstration behavioral data segment as training sample from demonstration behavioral data sequence.

Wherein, demonstration behavioral data sequence includes the demonstration behavioral data of multiple continuous different moments.And demonstrate behavior Data slot belongs to continuous a part of data segment in demonstration behavioral data sequence, correspondingly, in the demonstration behavioral data segment Including at least two demonstration behavioral datas with sequencing, that is, demonstration behavioral data segment includes the two neighboring moment Demonstrate behavioral data.The demonstration behavioral data includes the status information in each joint of presentation objects.

The status information in joint can characterize the particular state that the joint is presented, and be can reflect out by the status information The motion state in joint, and then reflect by the status information in each joint the action behavior of presentation objects.Such as, the shape in joint State information includes one of state values such as angle locating for joint and speed or several.Wherein, for the ease of with it is subsequent The status information in each joint in emulation distinguishes, and the status information in the joint of the presentation objects is known as first state letter Breath.

It is understood that obtain demonstration behavioral data sequence mode can there are many, e.g., in a kind of possible realization It in mode, can be after presentation objects demonstrate action behavior, presentation objects demonstration captured by motion capture equipment Demonstration data, the demonstration data can be used as demonstration behavioral data sequence；Alternatively, being to handle demonstration data to obtain the demonstration Behavioral data sequence.Certainly, the demonstration behavioral data sequence is obtained by other means and is applied equally to the present embodiment, to this It is without restriction.

It is available for training behaviour control strategy institute it is understood that carrying out sampling to demonstration behavioral data sequence The sample needed.Wherein, the specific of the demonstration behavioral data segment as training sample is sampled out from demonstration behavioral data sequence Mode can there are many.Such as, one piece of data can be sampled out as training sample from demonstration behavioral data sequence at random every time. It is of course also possible to be once to sample out multiple demonstration behavioral data segments from demonstration behavioral data sequence, but each training Period is only used only a demonstration behavioral data segment and is trained to neural network.

The each joint for the target object simulated in physics emulator is arranged according to demonstration behavioral data segment in S302 Initial state information, and determine to act on the work in each joint of the target object using neural network model to be trained Use force data.

Wherein, which is also referred to as physical engine, is a for simulating the simulated program of intelligent body movement.

In the embodiment of the present application, the intelligent body that can be simulated in the physical simulation device is the target object, meanwhile, the object Managing emulator can be with stress and motion conditions of the simulation objectives object in real physical space.

Wherein, which is the object of behavior technical ability to be learned, and e.g., by taking game application as an example, which can Think the game objects such as the game charater in game application.As front it is found that possessed by the target object and the presentation objects Joint is identical.

In the embodiment of the present application, behaviour control strategy is expressed by neural network model, therefore, passes through training nerve Network model is available for the behaviour control strategy controlled each joint of target object.Behavior control strategy The active force in each joint for the target object that can be exported by the neural network model characterizes.

It is understood that in order to enable the target object simulated in physical simulation device can learn to demonstrate behavioral data pair The demonstration behavior answered needs first to set the mesh in physical simulation device based on behavioral data is demonstrated in the demonstration behavioral data segment The initial state information for marking each joint of object, so that the initial actuating behavior for the target object simulated in physical simulation device It is consistent with action behavior first or intermediate in presentation objects in demonstration behavior segment.

Alternatively, in order to enable physical simulation device can simulate target object study demonstration behavioral data The corresponding each demonstration behavior of segment, can be according to presentation objects in demonstration behavioral data first in demonstration behavioral data segment The original state letter in each joint for the target object simulated in physics emulator is arranged in the first state information in each joint Breath.In that case, in physical simulation device the status information in each joint of target object with the demonstration behavioral data segment In the first state information of first demonstration behavioral data presentation objects corresponding joint for being included be consistent.

Correspondingly, can be by the first state information input in each joint of presentation objects in the first demonstration behavioral data To neural network model to be trained, each joint for controlling the target object of neural network model output is obtained Force data.It in this application, is that neural network is completed by the interaction between neural network model and physical simulation device The training of model, therefore, the neural network model need the demonstration behavioral data based on input, predict target object study The active force situation in each joint needed for the corresponding demonstration behavior of the demonstration behavioral data.Since the target object needs and should Presentation objects joint having the same, therefore, neural network model may be considered target object (alternatively, physical simulation herein The target object simulated in device) each joint force data, it is also assumed that be presentation objects each joint it is corresponding Force data.

Wherein, the force data in joint can be the data for the power being applied on the joint, e.g., be applied on the joint One or more of size, direction and the duration of control force etc. data.

Wherein, which can be set as needed, and alternatively, which can Think deep neural network model.

S303 controls the physical simulation device based on the force data in each joint that the neural network model determines The movement in each joint of the target object of middle simulation, so that the physical simulation device limits spy based on the action behavior of setting Sign simulates emulation behavioral data sequence.

The emulation behavioral data sequence includes at least one emulation behavioral data, which includes the target pair Second status information in each joint of elephant.

Wherein, it is imitative that the status information in each joint for the target object that physical simulation device simulates equally can reflect out this The action behavior of the target object really gone out, e.g., angle and speed etc. locating for each joint of the target object simulated Numerical value.For the ease of distinguishing, the status information in the joint of the target object simulated is known as the second status information.

It is understood that the situation that the initial state information in each joint of target object determines in physical simulation device Under, the force data in each joint of neural network model output is input in physical simulation device, physics can be made imitative True device simulates active force suffered by each joint of the target object, so that simulating each joint of the target object has In the case where corresponding active force, the movement in each joint of the target object changes, the target object simulated The status information in each joint.

It is understood that each joint for the target object simulated every time into physical simulation device applies direct action Each joint of power, the target object can have the variation of a status information, so that one for simulating the target object is imitative True behavioral data.

Physical simulation device can also constantly be interacted with neural network model, to simulate multiple emulation behavioral datas.Such as, root According to the quantity or combination actual needs of the demonstration behavioral data for including in demonstration behavioral data segment, it is imitative that physics can also be set The multiple interaction of true device and neural network model, that is, the emulation behavioral data for combining physics emulator to simulate update nerve net The force data of network model output, and the force data that neural network model exports is applied to physical simulation device mould again In quasi- target object, and the process is constantly repeated, a series of emulation behavioral data can be simulated, to obtain comprising extremely The emulation behavioral data sequence of few emulation behavioral data.

Particularly, it is additionally provided with action behavior in the physical simulation device of the application and limits feature, which limits Feature is used to limit the feature met needed for the action behavior of the target object of the simulation.That is, matching in physical simulation device The action behavior demand additionally met needed for being equipped with for limiting target object study action behavior.

Such as, which limits during feature can carry out action behavior for the target object of configuration simulation and needs The article of setting is carried, for example, target object needs to carry chest.

For another example, which limits feature and can be limited the action behavior mode of the quasi- target object of cover half, for example, Target object needs continuous transformation movement.

For another example, action behavior restriction feature, which can be limited, sets the goal object needs while controlling special article movement, Learn action behavior.

It is understood that passing through nerve net in the case where being provided with action behavior restriction feature in physical simulation device Interaction between network model and physical simulation device finally needs the action behavior of the target object simulated to meet principle: in mould Before the action behavior for the target object drawn up action behavior corresponding with the demonstration behavioral data of presentation objects is as similar as possible It puts, so that the action behavior of target object meets the action behavior and limits feature.

For example:

Assuming that the action behavior of presentation objects demonstration is walking motion, and demonstrating the destination of study is so that target object It practises and removes article walking This move behavior.In that case, then the action behavior limitation configured in physical simulation device is special Sign can remove article for target object.

S304, according to the first state information in each joint of presentation objects in the demonstration behavioral data and emulation behavior Second status information in each joint of the target object, determines between the target object of the simulation and the presentation objects in data Action behavior diversity factor.

Wherein, which is used to reflect the first state information and emulation in each joint of the presentation objects Comprehensive differences situation between second status information in each joint of the target object out.As it can be seen that the comprehensive differences situation Diversity factor between the action behavior for the target object actually simulated and the action behavior of the presentation objects.

Wherein it is determined that the concrete mode of action behavior diversity factor can be set as needed, e.g., physical simulation device is simulated Each emulation behavioral data be respectively for the corresponding each joint of each demonstration behavioral data in demonstration behavioral data segment First state information learnt, therefore, according to the elder generation of emulation behavioral data each in emulation behavioral data sequence The sequencing of each demonstration behavioral data in sequence afterwards, and demonstration behavioral data segment determines relatively corresponding emulation row For data and demonstration behavioral data.It, can be according to emulation behavioral data for each pair of emulation behavioral data and demonstration behavioral data The first state value in each joint of middle target object and the second state for demonstrating each joint in presentation objects in behavioral data Value, calculates separately target object state difference value corresponding with joint each in presentation objects, e.g., calculates target object and demonstration The Euclidean distance of the status information in each joint of object.It is then possible to which the average value according to all state difference values determines Action behavior diversity factor

S305 is based on the action behavior diversity factor, optimizes behaviour control strategy expressed by the neural network model, until Reach optimization aim, by the behaviour control strategy that the neural network model is expressed be determined as demonstration study in based on control plan Slightly.

It is understood that the action behavior diversity factor can reflect out the action behavior of the target object simulated and drill Show the difference degree of the action behavior of object presentation, therefore, which can be used as optimization neural network model Based on parameter.

Wherein, behaviour control strategy expressed by optimization neural network model is substantially exactly to adjust the neural network model Inner parameter, to change behaviour control strategy expressed by neural network model.

It alternatively, can be by demonstration study in conjunction with nitrification enhancement, correspondingly, can be dynamic according to this Make behavioral difference degree, and combine nitrification enhancement, determines pumping signal；According to the pumping signal, the neural network mould is adjusted Inner parameter in type.

It is understood that the optimization aim can be set as needed, reaches optimization aim and then illustrate to have demonstrated row For data, and the target object simulated in the behaviour control policy control physical simulation device exported by neural network model is dynamic The similarity degree for making the demonstration behavior of behavior and presentation objects meets the requirements.Such as, in an optional implementation manner, the optimization Target can be that action behavior diversity factor is minimum value, that is, action behavior diversity factor is determined dynamic before being less than current time Make behavioral difference degree.The optimization aim can also be that the variation amplitude for the action behavior diversity factor determined is less than setting value.

If based on currently determining action behavior diversity factor determining that optimization aim currently has not yet been reached, need to be based on The action behavior diversity factor optimizes the behaviour control strategy of neural network model expression, meanwhile, it needs using sampling out Training sample continue to train the neural network model.Such as, it if optimization aim has not yet been reached, needs to continue to train, e.g., such as Multiple demonstration behavioral data segments are sampled out in fruit step S301, then can choose the demonstration behavioral data segment for being not used for training Continue to execute the operation of step S302 to S305.Optionally, one only is being sampled out from demonstration behavioral data sequence every time It, then can be with return step S301, again from the demonstration behavior in the case where demonstration behavioral data segment as training sample A demonstration behavioral data segment is sampled out in data sequence as training sample, and continues to execute the behaviour of step S302 to S305 Make, until reaching optimization aim.

Correspondingly, can then terminate to learn (training in other words), then the nerve trained if it is determined that reach optimization aim Network model is used as the behaviour control strategy of target object in real scene.

It alternatively, can also be by the neural network model after training obtains the neural network model It is loaded into the destination application, to pass through behaviour control policy control target application expressed by the neural network model The action behavior of the target object of process control.Wherein, which is used to control the operation of target object, that is, should Destination application is the controlling extent of target object in practical application scene, and the target object not simulated in simulated environment Control program.

It such as,, can be by the nerve after training obtains the neural network model by taking the demonstration of field of play study as an example Network model is loaded into game application, with the action row based on game object in neural network model control game application For.Such as, the current action behavior of game object is input in neural network model, and being somebody's turn to do based on neural network model output The force data in each joint of game object controls the movement in each joint of the game object, so that game pair As action behavior that is similar to presentation objects and meeting behavior limited features can be obtained.

It can be seen via above technical scheme that behaviour control strategy needed for demonstration study passes through neural network in the application Model tormulation.Behaviour control expressed by neural network model is completed by the cooperation of neural network model and physical simulation device The training of strategy, moreover, during training neural network model, other than combining and demonstrating behavioral data, also in physics Action behavior limited features corresponding to object in emulator provided with behavior technical ability to be learned are limited special by action behavior Sign can limit the feature requirement met needed for the behavioural characteristic for the target object simulated in physics emulator, so that instruction Behaviour control strategy expressed by the neural network model practised can make target object generation to the greatest extent may be used with demonstration behavioral data Can be similar, and meet other action behaviors of the action behavior limited features of setting again.

It follows that can both have when controlling the action learning of target object based on the neural network model that training obtains And the similar action behavior of demonstration behavioral data and action behavior corresponding with demonstration behavioral data is arrived conducive to target object study Not exactly the same action behavior, it can the other similar action behavior expanded is conducive to target object and is being based on drilling Show that action behavior data can learn action behaviors different from the demonstration behavior of demostrating action behavioral data out, thus not having In the case where the demonstration behavioral data of certain action behavior, the also behaviour control strategy of available corresponding actions behavior, in turn It can learn similar with demonstration behavior out but be different action behavior based on behavior control strategy control target object, favorably In the complexity for reducing demonstration study.

In order to make it easy to understand, below to this Shen for the process by deeply study to train to obtain neural network model Scheme please is illustrated.In this kind of situation, deeply study is combined with demonstration study, and according to behavior to be learned The specific tasks requirement of the target object of technical ability, sets behavior act limited features, obtains being suitable for the target object with training Learn to presentation objects action form and meeting the action behavior of particular requirement.

Such as Fig. 4, it illustrates a kind of process signals of another embodiment of the learning method of behaviour control strategy of the application Figure, the present embodiment apply equally to above-mentioned computer equipment, and the method for the present embodiment may include:

S401, one section of demonstration behavioral data segment of stochastical sampling from the demonstration behavioral data sequence obtained.

Such as, the demonstration behavioral data in a continuous time period is randomly selected as the demonstration behavioral data segment, this is drilled Show that behavioral data segment includes presentation objects at least two continuous moment corresponding demonstration behavioral datas.Demonstration behavior number According to the first state value in same each joint including presentation objects.

It is understood that above step S401 is to sample out a demonstration behavioral data segment as training sample For illustrate, but for other situations, be also applied for the present embodiment.

S402, according in demonstration behavioral data segment in first demonstration behavioral data each joint of presentation objects the The initial state information in each joint for the target object simulated in physics emulator is arranged in one status information.

Such as, the initial state information in each joint of the target object in physical simulation device is gone with this first demonstration respectively First state information for the joint in presentation objects in data is consistent, to set target object in physical simulation device Original state so that the subsequent target object that can simulate of physical simulation device learns in the demonstration behavioral data segment Second and the corresponding demostrating action of subsequent demonstration behavioral data.

S403, by this first demonstrate behavioral data in presentation objects each joint first state information input to Trained neural network model obtains the active force in each joint for controlling the target object of neural network model output Data.

S404, the initial state information in each joint based on the target object simulated in the physical simulation device, foundation should The force data in each joint for the target object that neural network model determines, the target simulated into the physical simulation device Each joint of object applies active force, so that the physical simulation device limits feature based on the action behavior of setting, simulates One emulation behavioral data of the target object.

It is understood that in the case that the initial state information in each joint of target object determines in physical simulation device, To each joint active force of the target object, it can make the state in each joint in target object that primary change occur, obtain To an emulation behavioral data, which includes second status information in each joint of target object.

It is understood that the emulation behavioral data simulated in step S404 is in each joint of target object In the case where original state, the active force according to neural network model output simulates the state to each joint of target object Information, therefore, emulation behavioral data characterization be in physical simulation device target object study in demonstration behavior segment the Two demonstration behavioral datas learn action behavior out.

Whether the total quantity of S405, detection emulation behavioral data meet setting condition, if so, confirmation is obtained comprising at least The emulation behavioral data sequence of one emulation behavioral data, and execute step S408；If not, thening follow the steps S406.

Wherein, which can be set as needed, such as, it is assumed that set quantity for demonstrating in behavioral data segment A demonstration behavioral data carries out demonstration study, then can the setting condition can reach the setting quantity for total quantity.

Optionally, physical simulation device can be set to need to own in simulation objectives simulating demonstration behavioral data segment The corresponding demostrating action of behavioral data is demonstrated, therefore, which can be the total quantity and demonstration of the emulation behavioral data The quantity that behavioral data is demonstrated in behavioral data segment is consistent；Either, the total quantity for demonstrating behavioral data is more than the demonstration row For the quantity for demonstrating behavioral data in data slot.Herein, it should be noted that if target object is each in physical simulation device The initial state information in joint is also determined as the emulation behavioral data that the physical simulation device simulates, then imposing a condition Can be: the total quantity for emulating behavioral data be consistent with the demonstration quantity of behavioral data in demonstration behavioral data segment.If The initial state information in each joint of target object is not identified as one that the physical simulation device simulates in physical simulation device Behavioral data is emulated, then only needing to emulate demonstration behavioral data in the total quantity and demonstration behavioral data segment of behavioral data It is identical that quantity subtracts 1.

It is understood that impose a condition if the total quantity of emulation behavioral data meets, at least one will simulated A emulation behavioral data is determined as emulating behavioral data sequence.It is understood that if target object is each in physical simulation device The initial state information in a joint is also determined as the emulation behavioral data that the physical simulation device simulates, then emulation row It should include at least two emulation behavioral datas for data sequence.

The emulation behavioral data for the target object that the physical simulation device the last time simulates is input to the mind by S406 Through network model, the force data in each joint of updated target object is obtained.

S407, the force data in each joint according to the updated target object, is simulated into physical simulation device Target object each joint apply active force so that physical simulation device based on setting action behavior limit feature, imitate The emulation behavioral data of true target object out, and return step S405, until the total quantity of the emulation behavioral data simulated is full Foot imposes a condition.

In step S406 and S407, neural network model can based on the emulation behavioral data that physical simulation device simulates, The force data of application needed for updating each joint to target object, and control physical simulation device and continue simulation objectives object The movement in each joint, until obtaining multiple emulation behavioral datas.

Such as, it is assumed that demonstration behavioral data segment includes continuous 5 demonstrations behavioral data, then being based on demonstration behavior First demonstration behavioral data is provided with the original state letter in each joint of target object in physical simulation device in data slot Breath, so that the physical simulation device can be simulated by step S404 after physical simulation device obtains first emulation behavioral data Second emulation behavioral data corresponding with this second demonstration behavioral data, then three times by step S406 and S407 It repeats, it can also be a to the corresponding third of the 5th demonstration behavioral data with third in demonstration behavioral data segment To the 5th emulation behavioral data, to obtain the emulation behavioral data sequence comprising five emulation behavioral datas.

S408 according at least two demonstration behavioral datas in demonstration behavioral data segment and is emulated in behavior sequence at least One emulation behavioral data, determines the action behavior diversity factor between the target object of the simulation and the presentation objects.

It is understood that due to demonstrating in behavioral data segment in first demonstration behavioral data and the physical simulation device The initial state information in each joint of target object is consistent, then can only need to drill first in demonstration behavioral data segment Show and is simulated after the initial state information in each joint in demonstration behavioral data and physical simulation device after behavioral data Emulation behavioral data is compared.

Certainly, if to be also determined as the physics imitative for the initial state information in each joint of target object in physical simulation device The emulation behavioral data that true device simulates then can be with then physical simulation device can export at least two emulation behavioral datas Corresponding relationship in sequence, successively the corresponding demonstration behavioral data of comparison sequence and emulation behavioral data.

S409, according to the action behavior diversity factor and current time predetermined action behavior diversity factor, detection is dynamic Make whether behavioral difference degree reaches convergence state, if not, thening follow the steps S410；If it is, terminating training.

Wherein, which can be understood as the convergence state routinely set in intensified learning, several as previously mentioned Kind optimization aim, repeats no more this.

S410 determines pumping signal according to action behavior diversity factor.

It is understood that intensified learning is to remove training smart body, In using the physical engine and enhanced signal of height emulation In training process, intelligent body is constantly interacted using existing strategy with physical engine, generates a series of enhanced signal (i.e. Pumping signal), these pumping signals are used in more new strategy.In the present embodiment, strategy is expressed by neural network model, and The intelligent body is therefore the target object simulated in physical engine according to the action behavior diversity factor, can be determined to be used for Update pumping signal tactful in neural network model.

Wherein, the action behavior diversity factor is bigger, then the pumping signal is smaller；Conversely, the action behavior diversity factor is smaller, The pumping signal is bigger.

S411 adjusts the inner parameter in neural network model, according to the pumping signal to change the neural network model Expressed behaviour control strategy, and return step S401 go out to demonstrate action behavior segment with resampling.

It is understood that being the target object simulated by continuing to optimize neural network model target to be reached Can generate can be expressed as follows with demonstration data action behavior as similar as possible, the optimization problem:

min|τ-τ_E|, and follow h (τ)≤0, g (τ)=0；

Wherein, τ_ETo demonstrate behavioral data, τ is the emulation behavioral data of the target object for the simulation that final optimization pass obtains, The emulation behavioral data includes second status information in each joint of target object of simulation.H (τ)≤0 and g (τ)=0 indicates to use In two kinds of setting means for setting different action behavior limited features, e.g., h (τ)≤0 can be ability when being not belonging to certain Can with motion characteristic.And g (τ)=0 can be for equal to the action behavior feature that can just execute in the case of certain.

It follows that optimization problem essence be exactly generate meet action behavior limitation it is specific and with demonstration behavioral data as far as possible Similar optimization data, i.e. τ.

Correspondingly, defining excitation function using the τ of the optimization of generation as learning objective, carried out in physical simulation device a large amount of After emulation, the neural network model for updating expression behaviour control strategy can be removed with determining pumping signal.

For the ease of intuitively understand the application behaviour control strategy learning method, may refer to Fig. 5, it illustrates The present processes realization principle block schematic illustration.

As seen from Figure 5, after sampling out demonstration behavioral data in demonstration behavioral data sequence, behavioral data is demonstrated It can be input in neural network model, and neural network model is based on the demonstration behavioral data and can export for controlling physical simulation The force data in the corresponding each joint of the target object simulated in device, so that the physical simulation device can be based on movement Behavioural characteristic emulates the behavior of target object, and exports the emulation behavioral data of the target object of emulation.The emulation Behavioral data includes the status information in each joint of the target object simulated.Pass through contrast simulation behavioral data and sampling Demonstration behavioral data out can determine the behavioral difference degree between presentation objects and target object, in this way, Behavior-based control is poor Different degree can optimize the neural network model, until reaching convergence, so that the emulation behavioral data and phase of the output of physical simulation device The demonstration behavioral data answered, which approaches and emulates the action behavior that behavioral data is characterized, meets the action behavior feature.

The benefit of application scheme in order to facilitate understanding is introduced below with reference to an application scenarios.

Illustrate by taking the demonstration study of game charater in game application as an example, and assumes that game charater is needed to be based on real user Article walking is removed in the walking motion generation of demonstration.In that case, the learning method of the behaviour control strategy of the present embodiment It may refer to shown in Fig. 6, which can be applied to computer equipment, which may include:

S601 obtains the demonstration data sequence of the walking motion of real user demonstration.

In the present embodiment, to be learnt based on demonstration so that the game charater in game application may learn real user For behavior act, therefore, which is the data of the walking motion of real user demonstration.Specifically, the demonstration Data sequence includes: first state value of each joint at multiple and different moment of real user.

It is understood that the present embodiment is for needing the walking motion of game charater study real user, still If game charater movement to be learned is other movements, only needs to obtain real user or there is phase with game charater With the demonstration data sequence for the corresponding actions that the presentation objects in joint are demonstrated.For example, it is desired to which game charater study is turned a somersault Movement, then only demonstration data sequence need to be replaced with the demonstrations such as real user demonstration the demonstration data sequence turned a somersault i.e. It can.

S602, one section of demonstration data segment of stochastical sampling from demonstration data sequence.

S603, according to first demonstration data in demonstration data segment, game charater is each in setting physical simulation device The initial state information in a joint obtains first emulation behavioral data of the Mission Objective in physical simulation device.

Step S602 and S603 still in the way of a kind of sample train sample for illustrate, but for other sampling sides Formula is applied equally to the present embodiment.

First demonstration data is input to neural network model to be trained, obtains the neural network model by S604 The force data in each joint of the game charater to be simulated of output.

S605, the force data in each joint of the game charater according to neural network model output, controls object The movement in each joint for the game charater simulated in reason emulator, so that the belongings of the physical simulation device based on setting Product walking characteristics, the game charater simulated is in the case where belongings are walked, and the of each joint of the game charater Two-state information, second emulation behavioral data of the target object simulated.

It is understood that due to needing game charater to be taken based on the walking motion extension study that real user is demonstrated With article walking movement, therefore, the action behavior controlling feature configured in physical simulation device be target object belongings (such as Chest) this feature of walking.Correspondingly, physical simulation device can be according to the active force number in each joint of neural network model input According to the process walked to game charater belongings emulates, to export the game charater belongings that emulation obtains The emulation behavioral data of walking.The emulation behavioral data includes second status information in each joint of game charater.

It is understood that it is relevant to walking motion to expand other if necessary to the walking motion based on real user Action behavior, then configuration behavior movement restriction feature would also vary from emulation controller.Such as, need game charater according to The normal walking motion of real user learns constantly to alter one's posture out the motor skill of walking, configures in the physical simulation device dynamic Make behavior limit feature can be with are as follows: the walking postures of game charater adjacent moment are different.Certainly, the present embodiment is to learn to walk For the scene of movement, if action behavior to be learned is other situations, the action row that can be demonstrated according to presentation objects For and game charater needed for extension specific action behavior, the action behavior limited features in the physical simulation device are set.

The emulation behavioral data for the game charater that physical simulation device the last time simulates is input to neural network by S606 Model obtains the force data in each joint of updated game charater, and according to each of updated game charater Each joint of the force data in joint, the game charater simulated into physical simulation device applies active force, so that physics Emulator limits feature based on the action behavior of setting, simulates the emulation behavioral data of game charater, repeats step S606, Until the total quantity of the emulation behavioral data simulated is consistent with the demonstration total quantity of behavioral data in demonstration behavioral data segment.

Step S606 may refer to the related introduction of preceding embodiment, and details are not described herein.

S607, the emulation behavioral data sequence simulated according to demonstration behavioral data each in the demonstration behavioral data segment In each emulation behavioral data, determine simulation game charater and the real user between action behavior diversity factor.

S608 works as according to the action behavior diversity factor and current time predetermined action behavior diversity factor, detection Whether the action behavior diversity factor of preceding determination reaches minimum value, if not, thening follow the steps S609；If it is, terminating training.

The present embodiment is it by taking optimization aim reaches minimum for action behavior diversity factor as an example, but for optimization aim He is applied equally to the present embodiment at situation.

S609 determines pumping signal according to action behavior diversity factor.

S610 adjusts the inner parameter in neural network model, according to the pumping signal to change the neural network model Expressed behaviour control strategy, and return step S602 go out demonstration data sequence fragment with resampling, and resampling goes out One demonstration data as training sample.

It is understood that then neural network model training is completed, on the basis after confirmation reaches optimization aim On, then the action row of the game charater in game application can be controlled based on control strategy expressed by the neural network model For so that game charater may learn the action behavior of belongings walking.

Specifically, the neural network model trained can be loaded into game application, the trip in the game application Play personage carries article.In that case, the status information in each joint of the available game charater of game application, and The status information in each joint of game charater is input in the neural network model；Then, game application can be based on being somebody's turn to do The movement in the active force control each joint of game charater in each joint of the game charater of neural network model output, so that Obtaining game charater can be generated the movement of belongings walking.

By the present embodiment as it can be seen that the scheme of the application can be trained based on the walking motion of real user for controlling The corresponding neural network model of behaviour control strategy needed for the walking of game charater belongings, so as to be based on neural network Model carries out action control to the game charater in game application, so that game charater may learn and real user demonstration Walking motion is similar, and the movement skill of the belongings walking expanded on the basis of the walking motion of real user demonstration Energy.

By test, the scheme of the application can make game charater obtain the behavior that optimal carrying article is walked, can It is walked with long-time stable, realizes the inaccessiable effect of currently existing scheme institute.

A kind of learning method of behaviour control strategy of corresponding the application, present invention also provides a kind of behaviour control strategies Learning device.

As shown in fig. 7, it illustrates a kind of composition knots of learning device one embodiment of behaviour control strategy of the application The device of structure schematic diagram, the present embodiment may include:

Data sampling unit 701, for sampling out the demonstration behavior as training sample from demonstration behavioral data sequence Data slot, the demonstration behavioral data segment include at least two demonstration behavioral datas with sequencing, the demonstration Behavioral data includes the first state information in each joint of presentation objects；

Model cootrol unit 702, for the mesh simulated in physics emulator to be arranged according to the demonstration behavioral data segment The initial state information in each joint of object is marked, and determines to act on the target using neural network model to be trained The force data in each joint of object, the target object and presentation objects joint having the same；

Data simulation unit 703, each joint of the target object for being determined based on the neural network model Force data, the movement in each joint for the target object simulated in the physical simulation device is controlled, so that the object It manages emulator and feature is limited based on the action behavior of setting, simulate the emulation behavioral data sequence of the target object, it is described Emulation behavioral data sequence includes at least one emulation behavioral data with sequencing, and the emulation behavioral data includes institute Second status information in each joint of target object is stated, the action behavior limits the target that feature is used to limit the simulation The feature met needed for the action behavior of object；

Difference comparing unit 704, for the first shape according to each joint of presentation objects in the demonstration behavioral data Second status information in each joint of target object described in state information and the emulation behavioral data, determines the simulation Target object and the presentation objects between action behavior diversity factor；

Training optimization unit 705 optimizes expressed by the neural network model for being based on the action behavior diversity factor Behaviour control strategy the behaviour control strategy that the neural network model is expressed is determined as drilling until reach optimization aim Control strategy based in dendrography habit.

In one possible implementation, the training optimization unit, comprising:

Whether detection sub-unit reaches the optimization aim of setting for detecting the action behavior diversity factor；

Circuit training subelement is based on if being not up to the optimization aim set for the action behavior diversity factor The action behavior diversity factor, optimizes the behaviour control strategy of the neural network model expression, and returns and execute the data The operation of sampling unit；

Finishing control subelement confirms if reaching the optimization aim of setting for the action behavior diversity factor Practise complete, by the behaviour control strategy that the neural network model is expressed be determined as demonstration study in based on control strategy.

Optionally, the training optimization unit or circuit training subelement are being based on the action behavior diversity factor, excellent When changing the behaviour control strategy of the neural network model expression, specifically, being used for according to the action behavior diversity factor, and base In nitrification enhancement, pumping signal is determined；According to the pumping signal, the inside ginseng in the neural network model is adjusted Number, to change behaviour control strategy expressed by the neural network model.

In one possible implementation, the model cootrol unit, comprising:

Initialization unit is emulated, for according to presentation objects in first demonstration behavioral data in the demonstration behavioral data segment Each joint first state information, the original state letter in each joint of target object simulated in physics emulator is set Breath；

Starting force determination unit, for by the of each joint of presentation objects described in the first demonstration behavioral data One status information is input to neural network model to be trained, obtain neural network model output for controlling the mesh Mark the force data in each joint of object.

In another possible implementation, the data simulation unit, comprising:

Simulation Control unit, the initial shape for each joint based on the target object simulated in the physical simulation device State information, the force data in each joint of the target object determining according to the neural network model, Xiang Suoshu object The each joint for the target object simulated in reason emulator applies active force, so that the physical simulation device is based on the dynamic of setting Make behavior and limit feature, simulates an emulation behavioral data of the target object；

Finishing control unit is emulated, imposes a condition, confirms if the total quantity for the emulation behavioral data meets Obtain the emulation behavioral data sequence comprising at least one emulation behavioral data；

Simulation cycles unit, if the total quantity for the emulation behavioral data does not meet setting condition, by the object The emulation behavioral data for the target object that reason emulator the last time simulates is input to the neural network model, obtains The force data in each joint of the updated target object, and according to each of the updated target object Each joint of the force data in a joint, the target object simulated in Xiang Suoshu physical simulation device applies active force, so that It obtains the physical simulation device and feature is limited based on the action behavior of setting, simulate the emulation behavioral data of the target object, It imposes a condition until the total quantity of the emulation behavioral data simulated meets.

Optionally, which can also include:

Model applying unit, for obtaining the behaviour control strategy of the neural network model expression in training optimization unit Later, the neural network model is loaded into destination application, to pass through row expressed by the neural network model The action behavior of the target object of the destination application control is controlled for control strategy, the destination application is for controlling The operation of target object processed.

On the other hand, present invention also provides a kind of storage medium, it is stored with that computer is executable to be referred in the storage medium It enables, when the computer executable instructions are loaded and executed by processor, realizes the behavior control in as above any one embodiment Make the learning method of strategy.

It should be noted that all the embodiments in this specification are described in a progressive manner, each embodiment weight Point explanation is the difference from other embodiments, and the same or similar parts between the embodiments can be referred to each other. For device class embodiment, since it is basically similar to the method embodiment, so being described relatively simple, related place ginseng See the part explanation of embodiment of the method.

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or equipment for including element.

The foregoing description of the disclosed embodiments can be realized those skilled in the art or using the present invention.To this A variety of modifications of a little embodiments will be apparent for a person skilled in the art, and the general principles defined herein can Without departing from the spirit or scope of the present invention, to realize in other embodiments.Therefore, the present invention will not be limited It is formed on the embodiments shown herein, and is to fit to consistent with the principles and novel features disclosed in this article widest Range.

The above is only the preferred embodiment of the present invention, it is noted that those skilled in the art are come It says, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications also should be regarded as Protection scope of the present invention.

Claims

1. a kind of learning method of behaviour control strategy characterized by comprising

The demonstration behavioral data segment as training sample, the demonstration behavioral data are sampled out from demonstration behavioral data sequence Segment includes at least two demonstration behavioral datas with sequencing, and the demonstration behavioral data includes each of presentation objects The first state information in joint；

According to the demonstration behavioral data segment, the initial shape in each joint for the target object simulated in physics emulator is set State information, and determine to act on the active force number in each joint of the target object using neural network model to be trained According to the target object and presentation objects joint having the same；

Based on the force data in each joint of the determining target object of the neural network model, the physics is controlled The movement in each joint for the target object simulated in emulator, so that the action behavior of the physical simulation device based on setting Feature is limited, the emulation behavioral data sequence of the target object is simulated, the emulation behavioral data sequence includes having first At least one emulation behavioral data of sequence afterwards, the emulation behavioral data include the second of each joint of the target object Status information, the action behavior limit the spy met needed for the action behavior for the target object that feature is used to limit the simulation Sign；

According to the first state information in each joint of presentation objects and the emulation behavior number in the demonstration behavioral data Second status information in each joint of the target object described in, determine the simulation target object and the presentation objects Between action behavior diversity factor；

Based on the action behavior diversity factor, optimize behaviour control strategy expressed by the neural network model, until reaching Optimization aim, by the behaviour control strategy that the neural network model is expressed be determined as demonstration study in based on control plan Slightly.

2. the learning method of behaviour control strategy according to claim 1, which is characterized in that described to be based on the action row For diversity factor, optimize behaviour control strategy expressed by the neural network model, until reaching optimization aim, comprising:

Detect the optimization aim whether the action behavior diversity factor reaches setting；

If the action behavior diversity factor is not up to the optimization aim set, it is based on the action behavior diversity factor, optimization The behaviour control strategy of the neural network model expression, and return to described sample out from demonstration behavioral data sequence of execution and make For the operation of the demonstration behavioral data segment of training sample；

If the action behavior diversity factor reaches the optimization aim of setting, confirm that study is completed.

3. the learning method of behaviour control strategy according to claim 1 or 2, which is characterized in that described based on described dynamic Make behavioral difference degree, optimize the behaviour control strategy of the neural network model expression, comprising:

According to the action behavior diversity factor, and it is based on nitrification enhancement, determines pumping signal；

According to the pumping signal, the inner parameter in the neural network model is adjusted, to change the neural network model Expressed behaviour control strategy.

4. the learning method of behaviour control strategy according to claim 1, which is characterized in that described to go according to the demonstration For data slot, the initial state information in each joint for the target object simulated in physics emulator is set, and using wait instruct Experienced neural network model is determined to act on the force data in each joint of the target object, comprising:

First state according to each joint of presentation objects in first demonstration behavioral data in the demonstration behavioral data segment The initial state information in each joint for the target object simulated in physics emulator is arranged in information；

By the first state information input in each joint of presentation objects described in the first demonstration behavioral data to wait train Neural network model, obtain the effect in each joint for controlling the target object of neural network model output Force data.

5. the learning method of behaviour control strategy according to claim 1 or 4, which is characterized in that described to be based on the mind The force data in each joint of the target object determined through network model is controlled and is simulated in the physical simulation device The movement in each joint of target object, so that the physical simulation device limits feature, emulation based on the action behavior of setting The emulation behavioral data sequence of the target object out, comprising:

The initial state information in each joint based on the target object simulated in the physical simulation device, according to the nerve net The force data in each joint for the target object that network model determines, the target pair simulated in Xiang Suoshu physical simulation device Each joint of elephant applies active force, so that the physical simulation device limits feature based on the action behavior of setting, simulates One emulation behavioral data of the target object；

It imposes a condition if the total quantity of the emulation behavioral data meets, confirmation is obtained comprising at least one emulation behavior number According to emulation behavioral data sequence；

If the total quantity of the emulation behavioral data does not meet setting condition, described physical simulation device the last time is simulated The emulation behavioral data of the target object be input to the neural network model, obtain the updated target object The force data in each joint, and the force data in each joint according to the updated target object, to The each joint for the target object simulated in the physical simulation device applies active force, so that the physical simulation device is based on setting Fixed action behavior limits feature, simulates the emulation behavioral data of the target object, until the emulation behavior number simulated According to total quantity meet impose a condition.

6. the demonstration learning method of action behavior according to claim 1, which is characterized in that obtaining the neural network After the behaviour control strategy of model tormulation, further includes:

The neural network model is loaded into destination application, to pass through behavior expressed by the neural network model Control strategy controls the action behavior of the target object of the destination application control, and the destination application is for controlling The operation of target object.

7. a kind of learning device of behaviour control strategy characterized by comprising

Data sampling unit, for sampling out the demonstration behavioral data piece as training sample from demonstration behavioral data sequence Section, the demonstration behavioral data segment include at least two demonstration behavioral datas with sequencing, the demonstration behavior number According to the first state information in each joint for including presentation objects；

Model cootrol unit, for the target object simulated in physics emulator to be arranged according to the demonstration behavioral data segment Each joint initial state information, and determine to act on the target object using neural network model to be trained The force data in each joint, the target object and presentation objects joint having the same；

Data simulation unit, the active force in each joint of the target object for being determined based on the neural network model Data control the movement in each joint for the target object simulated in the physical simulation device, so that the physical simulation device Action behavior based on setting limits feature, simulates the emulation behavioral data sequence of the target object, the emulation behavior Data sequence includes at least one emulation behavioral data with sequencing, and the emulation behavioral data includes the target pair Second status information in each joint of elephant, the action behavior limit the dynamic of target object of the feature for limiting the simulation Make the feature met needed for behavior；

Difference comparing unit, for according to it is described demonstration behavioral data in presentation objects each joint first state information with And second status information in each joint of target object described in the emulation behavioral data, determine the target pair of the simulation As the action behavior diversity factor between the presentation objects；

Training optimization unit optimizes behavior expressed by the neural network model for being based on the action behavior diversity factor The behaviour control strategy that the neural network model is expressed is determined as demonstration study until reaching optimization aim by control strategy In based on control strategy.

8. the learning device of behaviour control strategy according to claim 7, which is characterized in that the training optimization unit, Include:

Circuit training subelement, if being not up to the optimization aim set for the action behavior diversity factor, based on described Action behavior diversity factor, optimizes the behaviour control strategy of the neural network model expression, and returns and execute the data sampling The operation of unit；

Finishing control subelement confirms and has learnt if reaching the optimization aim of setting for the action behavior diversity factor At, by the behaviour control strategy that the neural network model is expressed be determined as demonstration study in based on control strategy.

9. a kind of computer equipment characterized by comprising

Processor and memory；

The processor, for calling and executing the program stored in the memory；

Based on the force data in each joint of the determining target object of the neural network model, the physics is controlled The movement in each joint for the target object simulated in emulator, so that the action behavior of the physical simulation device based on setting Feature is limited, the emulation behavioral data sequence of the target object is simulated, the emulation behavioral data sequence includes having first At least one emulation behavioral data of sequence afterwards, the emulation behavioral data include the second of each joint of the target object Status information, the action behavior limit the spy met needed for the action behavior for the target object that feature is used to limit the simulation Sign；According to the first state information in each joint of presentation objects and the emulation behavioral data in the demonstration behavioral data Described in target object each joint the second status information, determine the simulation target object and the presentation objects it Between action behavior diversity factor；

10. a kind of storage medium, which is characterized in that be stored with computer executable instructions, the calculating in the storage medium When machine executable instruction is loaded and executed by processor, as above behaviour control strategy as claimed in any one of claims 1 to 6 is realized Learning method.