CN117001673B

CN117001673B - Training method and device for robot control model and computer equipment

Info

Publication number: CN117001673B
Application number: CN202311172535.2A
Authority: CN
Inventors: 宫爱成; 吕加飞; 杨瑞; 杨宇; 李秀; 孔凯贺; 李东原; 金鑫; 徐教珅; 马廷伟; 王鑫
Original assignee: China Nuclear Power Engineering Co Ltd
Current assignee: China Nuclear Power Engineering Co Ltd
Priority date: 2023-09-11
Filing date: 2023-09-11
Publication date: 2024-06-04
Anticipated expiration: 2043-09-11
Also published as: CN117001673A

Abstract

The application relates to a training method and device for a robot control model and computer equipment. The method comprises the following steps: aiming at each sample moment, acquiring a to-be-executed action at the sample moment through a strategy model, the state of the robot at the sample moment and a moving target, acquiring the state and an initial rewarding value of the robot at the next moment of the sample moment after executing the to-be-executed action, and forming a transfer sample corresponding to the sample moment; for a transfer sample corresponding to each sample time, acquiring an actual motion result of the robot after the action to be executed in the transfer sample is executed and a mapping motion result mapped and determined according to the corresponding state of the sample time in the transfer sample; according to the difference degree between the actual motion result and the mapping motion result, the initial rewarding value is adjusted, and the motion target is updated to be the actual motion result; based on the transfer samples, a robot control model including a strategy model is trained. The method can improve the control effect of the robot.

Description

Training method and device for robot control model and computer equipment

Technical Field

The application relates to the technical field of multi-target reinforcement learning, in particular to a training method, a training device and computer equipment for a robot control model.

Background

In the prior art, the robot is controlled to execute multi-target tasks through multi-target reinforcement learning, but general multi-target reinforcement learning, especially multi-target reinforcement learning under the situation of sparse rewards, is difficult to acquire successful experience when the robot explores a three-dimensional space.

In this case, most of the training samples in the experience playback pool are failed experiences, and multi-objective reinforcement learning training is performed using the failed experiences, so that the robot control effect is poor.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a training method, apparatus, and computer device for a robot control model that can improve the control effect of a robot.

In a first aspect, the present application provides a method for training a robot control model, including:

Processing the state of the robot and a moving target at each sample moment through a strategy model to obtain a to-be-executed action at the sample moment, after the robot executes the to-be-executed action, obtaining the state and an initial rewarding value of the robot at the next moment of the sample moment, and forming a transfer sample corresponding to the sample moment by the corresponding state of the sample moment, the to-be-executed action, the initial rewarding value, the corresponding state of the sample moment and the moving target at the next moment of the sample moment;

For a transfer sample corresponding to each sample moment, acquiring an actual motion result of the robot after the action to be executed in the transfer sample is executed and a mapping motion result mapped and determined according to a corresponding state of the sample moment in the transfer sample;

According to the difference degree between the actual motion result and the mapping motion result, the initial reward value in the transfer sample is adjusted, and the motion target in the transfer sample is updated to the actual motion result;

based on the transfer samples, a robot control model including the strategy model is trained.

In one embodiment, the obtaining the actual motion result of the robot after the motion to be performed in the transfer sample is performed includes:

And acquiring a historical motion trail of the robot, and determining an actual motion result of the robot after the motion to be executed in the transfer sample is executed based on the historical motion trail.

In one embodiment, the adjusting the initial prize value in the transfer sample according to the degree of difference between the actual motion result and the mapped motion result includes:

wherein r _t 'is the adjusted reward value, s _t is the corresponding state of the sample time, g' is the actual motion result, phi (s _t) is the mapping motion result, and epsilon is the preset deviation threshold.

In one embodiment, the training the robot control model including the strategy model based on the transfer samples includes:

based on the transfer samples, adjusting hyper-parameters in a robot control model comprising the strategy model;

and training the robot control model by utilizing the adjusted super parameters and the transfer sample.

In one embodiment, the hyper-parameter is a number of steps; the adjusting, based on the transfer samples, hyper-parameters in a robot control model including the strategy model, comprising:

wherein n is the number of steps, r 'is the adjusted prize value corresponding to the test stage of the robot control model, and E r' is the expected value of the adjusted prize value corresponding to the test stage of the robot control model.

In one embodiment, the superparameter is a weighting parameter; the adjusting of the hyper-parameters in the robot control model including the strategy model comprises:

Wherein λ is the adjusted weight parameter, and λ _t is the initial weight parameter.

In a second aspect, the present application also provides a training device for a robot control model, including:

The sample acquisition module is used for processing the state of the robot and the moving target at the sample moment through a strategy model according to each sample moment to acquire a to-be-executed action at the sample moment, acquiring the state and the initial reward value of the robot at the next moment of the sample moment after the robot executes the to-be-executed action, and forming a transfer sample corresponding to the sample moment by the corresponding state of the sample moment, the to-be-executed action, the initial reward value, the corresponding state of the sample moment and the moving target at the next moment of the sample moment;

The motion result acquisition module is used for acquiring an actual motion result of the robot after the motion to be executed in the transfer sample is executed and a mapping motion result mapped and determined according to the corresponding state of the sample moment in the transfer sample aiming at the transfer sample corresponding to each sample moment;

The sample adjusting module is used for adjusting the initial rewarding value in the transfer sample according to the difference degree between the actual motion result and the mapping motion result and updating the motion target in the transfer sample into the actual motion result;

And the model training module is used for training a robot control model containing the strategy model based on the transfer sample.

In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of any of the methods described above when the processor executes the computer program.

In a fourth aspect, the present application also provides a computer-readable storage medium. The computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods described above.

In a fifth aspect, the present application also provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the methods described above.

According to the training method, the device and the computer equipment of the robot control model, the state and the moving target of the robot at the sample moment are processed through the strategy model for each sample moment, the action to be executed at the sample moment is obtained, after the action to be executed is executed by the robot, the state and the initial rewarding value of the robot at the next moment of the sample moment are obtained, the transfer sample corresponding to the sample moment, the action to be executed, the initial rewarding value, the state corresponding to the next moment of the sample moment and the moving target are formed, the actual moving result of the robot after the action to be executed in the transfer sample is obtained for each transfer sample corresponding to the sample moment, and the mapping moving result which is mapped and determined according to the state corresponding to the sample moment in the transfer sample are obtained, the initial rewarding value in the transfer sample is adjusted according to the difference degree between the actual moving result and the mapping moving result, the moving target in the transfer sample is updated to the actual moving result, and the robot control model comprising the strategy model is trained based on the transfer sample. Compared with the problem of poor control effect of the robot in the prior art, the method and the device have the advantages that based on the actual motion result of the robot after the action to be executed in the transfer sample is executed and the mapping motion result mapped and determined according to the corresponding state of the sample time in the transfer sample, the initial reward value and the motion target in the transfer sample are updated and adjusted, successful experience of the updated and adjusted transfer sample can be ensured, the effect of training the robot control model is better, and therefore the robot control effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a flow chart of a training method of a robot control model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a historical motion profile provided in one embodiment;

FIG. 3 is a flow diagram of training a robot control model including a strategy model in one embodiment;

FIG. 4 is a schematic diagram of a robot control model provided in one embodiment;

FIG. 5 is a block diagram of a robot control model training apparatus according to an embodiment of the present application;

fig. 6 is an internal structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In this embodiment, the method is applied to a computer device for illustration, and it can be understood that the method can also be applied to a server, and can also be applied to a system including the computer device and the server, and implemented through interaction between the computer device and the server.

For ease of understanding, a brief description of the robot control model will be provided herein. The robot control model is used for controlling the robot to execute multi-target tasks. The robot control model may employ DDPG (DEEP DETERMINISTIC Policy Gradient, deterministic strategy Gradient algorithm), including a strategy model and a value function model. The input of the strategy model, namely the Actor part, is the state and the moving target of the robot, and the output is the action to be executed. The input of the value function model, i.e. Critic part, is the state of the robot, the action to be executed and the moving target, and the output is the future accumulated discount rewards. Future cumulative discount rewards are used to calculate a loss function of the policy model.

Fig. 1 is a flow chart of a training method of a robot control model provided in an embodiment of the present application, where the method is applied to a computer device, and in one embodiment, as shown in fig. 1, the method includes the following steps:

S101, processing the state of the robot and a moving target at the sample moment through a strategy model according to each sample moment to obtain a to-be-executed action at the sample moment, after the robot executes the to-be-executed action, obtaining the state and an initial rewarding value of the robot at the next moment of the sample moment, and forming a transfer sample corresponding to the sample moment by the corresponding state, to-be-executed action, the initial rewarding value, the corresponding state and the moving target at the next moment of the sample moment.

Wherein, the state of the robot is various characteristic physical parameters of the robot. The moving target is a target which the set robot needs to reach. The moving object includes an object position. The action to be executed is the action required to be executed by the robot. The initial reward value is a reward value fed back by the environment based on a set reward function after the robot performs corresponding actions. The strategy model is an integral part of the robot control model.

Taking an example of a robot executing a robot arm to grasp an object on a desktop, the state of the robot comprises the geometric position of the robot arm, the gesture of the robot arm and the speed; the action to be executed is the grabbing of the robot arm or the movement towards a specific direction; the moving object is to reach the geometric position of the object on the desktop and pick up the object.

Specifically, the transfer sample corresponding to the sample time t is (s _t,a_t,r_t,s_t+1, g), where s _t is the state of the robot at the time t, a _t is the action to be performed, r _t is the initial prize value, s _t+1 is the state of the robot at the time next to the time t, and g is the moving target.

S102, acquiring an actual motion result of the robot after executing the action to be executed in the transfer sample and a mapping motion result mapped and determined according to the corresponding state of the sample time in the transfer sample according to the transfer sample corresponding to each sample time.

The actual motion result is a result actually achieved after the robot executes the action to be executed in the transfer sample. The actual movement result includes the actual movement position. The mapping motion result is a result mapped out based on the corresponding state at the sample time in the transition sample. The mapping movement result includes mapping movement positions.

Specifically, due to the use of abrasion and other reasons, the robot may not achieve an ideal motion result in the corresponding state, and the mapping motion result can reflect the motion result that the robot should achieve in the current state.

And S103, adjusting an initial rewarding value in the transfer sample according to the difference degree between the actual motion result and the mapping motion result, and updating the motion target in the transfer sample to the actual motion result.

It is worth noting that the initial reward value in the transfer sample is adjusted according to the difference degree between the actual motion result and the mapping motion result, so that the robot motion can be attached to the current robot use condition.

S104, training a robot control model comprising a strategy model based on the transfer sample.

It should be appreciated that after the robot is controlled to perform the multi-objective task using the robot control model, a transfer sample is generated, and the transfer sample is placed in the experience playback pool by the offline policy algorithm, and the transfer sample of the failed experience is re-labeled, i.e., updated and adjusted, to a transfer sample of the successful experience using post-experience playback (HINDSIGHT EXPERIENCE REPLAY, HER). The transfer sample of the failed experience consists of a state corresponding to the sample moment, an action to be executed, an initial rewarding value, a state corresponding to the next moment of the sample moment and a moving target; the transfer sample of successful experience consists of a corresponding state at the sample time, an action to be executed, an adjusted rewarding value, a corresponding state at the next time at the sample time and an actual movement result.

Specifically, the transfer sample after the sample time t is labeled again is (s _t,a_t,r_t',s_t+1, g '), where r _t ' is the adjusted prize value, and g ' is the actual exercise result.

According to the training method of the robot control model, for each sample moment, the state and the moving target of the robot at the sample moment are processed through the strategy model, the action to be executed at the sample moment is obtained, after the robot executes the action to be executed, the state and the initial rewarding value of the robot at the next moment of the sample moment are obtained, the transfer sample corresponding to the sample moment, the action to be executed, the initial rewarding value, the state corresponding to the next moment of the sample moment and the moving target are formed, for the transfer sample corresponding to each sample moment, the actual movement result of the robot after the action to be executed in the transfer sample is obtained, and the mapping movement result which is mapped and determined according to the state corresponding to the sample moment in the transfer sample is obtained, the initial rewarding value in the transfer sample is adjusted according to the difference degree between the actual movement result and the mapping movement result, the moving target in the transfer sample is updated to the actual movement result, and the robot control model comprising the strategy model is trained based on the transfer sample. Compared with the problem of poor robot control effect in the prior art, the method and the device have the advantages that the initial reward value and the moving target in the transfer sample are updated and adjusted based on the actual movement result of the robot after the movement to be executed in the transfer sample is executed and the mapping movement result mapped and determined according to the corresponding state of the sample time in the transfer sample, successful experience of the updated and adjusted transfer sample can be guaranteed, the robot control model training effect is better, and accordingly the robot control effect is improved.

In one embodiment, obtaining an actual motion result of the robot after performing the action to be performed in the transfer sample includes:

The historical motion trail is a set formed by transfer samples at continuous moments. Specifically, a schematic diagram of a historical motion track is provided, as shown in fig. 2. "trajectory" in fig. 2 is a historical motion trajectory, which is composed of the corresponding states S1, S2, S3, …, ST in the transition samples at successive moments, the "Success" dashed circle represents a successful experience, the "goal" represents an actual motion result, the "Fail" dashed circle represents a failed experience, and the "goal" represents a moving object.

In some embodiments, determining an actual motion result of the robot after performing the action to be performed in the transfer sample based on the historical motion trajectory includes: and screening the state of the next time of the sample time from the historical motion track, and obtaining an actual motion result based on the state of the next time of the sample time.

In this embodiment, the actual movement result is determined through the historical movement track of the robot, so that a basis can be provided for the conversion from the failure experience, i.e., the movement target, to the success experience, i.e., the actual movement result.

In one embodiment, adjusting the initial prize value in the transferred sample based on the degree of difference between the actual motion outcome and the mapped motion outcome comprises:

Wherein r _t 'is the adjusted rewarding value, s _t is the corresponding state of the sample time, g' is the actual movement result, phi (s _t) is the mapping movement result, epsilon is the preset deviation threshold.

Specifically, the mapping relation between the corresponding state of the sample time in the transfer sample and the mapping motion result is set by human.

In this embodiment, when the degree of difference between the actual motion result and the mapped motion result is smaller than the preset deviation threshold, the prize value is 0, so that the actual motion of the robot can be better attached to the current use condition of the robot.

In one embodiment, based on the transfer samples, a flow diagram for training a robot control model including a strategy model, as shown in fig. 3, includes the following:

S301, based on the transfer sample, adjusting super parameters in a robot control model comprising a strategy model.

The transfer sample is composed of a corresponding state at the sample time, an action to be executed, an adjusted rewarding value, a corresponding state at the next time of the sample time and an actual movement result. The super-parameters are parameters of set values before the robot control model starts training. The super-parameters include step numbers and weight parameters.

For the convenience of understanding, the reason for adjusting the super parameter is briefly explained herein. The value function model contained in the robot control model is unfolded by adopting a multi-step method, wherein the problem of arranging the unfolded step number and the problem of arranging the weight parameters caused by offline deviation due to an offline strategy algorithm are involved.

Specifically, in the initial stage of training, the robot control model has no priori knowledge of the robot environment and tasks thereof, and learning and training are performed through continuous trial and error, so that the stage is not suitable for using a multi-step method, but the training is performed by directly using a method for playing back experience after the past, because the failed experience is most in the stage, and the failed experience needs to be converted into successful experience; in the middle of training, if the initial weight parameters are too large, they can be reduced appropriately, this stage being in the transition phase of the multi-step method and the post-experience playback, although the multi-step method has already been started to be applied, the weight parameters should be smaller a bit, since we should still be closer to the single-step post-experience playback to accumulate more and better experience at this stage; at the later stage of training, the weight parameters should be increased to favor multi-step methods more, while a larger number of steps should be used to obtain more future information, which in turn may better train the robot control model.

It should be understood that adjusting the superparameter in the robot control model comprising the strategy model may be adjusting only one superparameter or may be adjusting both superparameters.

S302, training a robot control model by using the adjusted super parameters and the transfer samples.

In this embodiment, the robot control model is trained by using the successfully experienced transfer sample and the adjusted hyper-parameters, so that the model training effect can be ensured, and the robot control effect is improved.

In one embodiment, the hyper-parameter is a number of steps; based on the transfer samples, adjusting hyper-parameters in a robot control model comprising a strategy model, comprising:

Wherein n is the number of steps, r 'is the adjusted prize value corresponding to the test stage of the robot control model, and E r' is the average value, which is the expected value of the adjusted prize value corresponding to the test stage of the robot control model.

In this embodiment, the execution quality of the current robot control model is determined based on the adjusted reward value corresponding to the test stage, so that the number of steps in the robot control model is adjusted, and the training process of the robot control model can be adjusted in real time, so that the training effect of the robot control model is better.

In one embodiment, the superparameter is a weighting parameter; adjusting hyper-parameters in a robot control model comprising a strategy model, comprising:

In this embodiment, in order to prevent the weight parameter from being excessively extreme, the clipping range is set, so that a better training effect of the robot control model can be achieved.

Here, a schematic structural diagram of a robot control model is provided, as shown in fig. 4. The robot control model provided by the embodiment adopts a multi-step post experience playback technology based on self-adaptive adjustment super-parameters on the basis of a deterministic strategy gradient algorithm.

Compared with the prior art, the robot control model provided by the embodiment adopts a multi-step post experience playback technology based on self-adaptive adjustment super parameters on the basis of a deterministic strategy gradient algorithm, so that the utilization efficiency of data and the robustness of the algorithm generated when the multi-target reinforcement learning robot interacts with the environment are remarkably improved, and the success rate of completing task targets of the robot is improved.

To verify the advantages of the present application, simulation environments a and B are selected for algorithmic testing, with eight tasks A1, A2, A3, A4, B1, B2, B3, and B4 controlled by a common robot in both simulation environments, one of which is, for example, to operate the robot to pick up an object and place it in another location.

And comparing the iteration times with the task average success rate, wherein the comparison objects are algorithms C1, C2, C3 and C4 respectively. The smaller the iteration number, the faster the convergence, and the better the algorithm performance. C1 is a deterministic strategy gradient algorithm, C2 is a post-hoc experience playback, C3 is a deterministic strategy gradient algorithm combined with a multi-step method, and C4 is an algorithm which adopts a multi-step post-hoc experience playback technology based on self-adaptive adjustment of super parameters based on the deterministic strategy gradient algorithm. Table 1 is a comparison of iteration number among a plurality of algorithms, and table 2 is a comparison of task average success rates among a plurality of algorithms, as follows:

TABLE 1 iteration times

Algorithm	A1	A2
			C1	3	——
C2	2	22
			C4	1	15

Table 2 task average success rate

Algorithm	A1	A2	A3	A4	B1	B2	B3	B4
									C1	1.000	0.072	0.023	0.046	0.000	0.000	0.000	0.002
C3	1.000	0.123	0.065	0.061	0.000	0.001	0.000	0.000
									C2	1.000	0.995	0.547	0.872	0.463	0.675	0.244	0.143
C4	1.000	0.998	0.651	0.940	0.642	0.690	0.212	0.227

From tables 1 and 2, it is readily apparent that the algorithm provided by the present application is optimal in both convergence speed and final performance.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a training device for the robot control model, which is used for realizing the training method of the robot control model. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the training device for one or more robot control models provided below may be referred to the limitation of the training method for the robot control model hereinabove, and will not be repeated here.

Referring to fig. 5, fig. 5 is a block diagram of a training apparatus for a robot control model according to an embodiment of the present application, where the apparatus 500 includes: a sample acquisition module 501, a motion result acquisition module 502, a sample adjustment module 503, and a model training module 504, wherein:

The sample obtaining module 501 is configured to process, for each sample time, a state of the robot and a moving target at the sample time through a policy model, obtain a to-be-executed action at the sample time, obtain, after the robot executes the to-be-executed action, a state and an initial reward value of the robot at a next time of the sample time, and form a transition sample corresponding to the sample time from the state corresponding to the sample time, the to-be-executed action, the initial reward value, the state corresponding to the next time of the sample time, and the moving target;

The motion result obtaining module 502 is configured to obtain, for a transfer sample corresponding to each sample time, an actual motion result of the robot after performing the action to be performed in the transfer sample, and a mapped motion result mapped and determined according to a state corresponding to the sample time in the transfer sample;

The sample adjustment module 503 is configured to adjust an initial reward value in the transfer sample according to a degree of difference between the actual motion result and the mapped motion result, and update a motion target in the transfer sample to the actual motion result;

model training module 504 is configured to train a robot control model including a strategy model based on the transfer samples.

According to the training device of the robot control model, the sample acquisition module is used for processing the state of the robot and the moving target at each sample moment through the strategy model, the action to be executed at the sample moment is obtained, after the action to be executed is executed by the robot, the state and the initial reward value of the robot at the next moment of the sample moment are obtained, the transfer sample corresponding to the sample moment, the action to be executed, the initial reward value, the state corresponding to the next moment of the sample moment and the moving target are formed, the actual moving result of the robot after the action to be executed in the transfer sample is obtained through the moving result acquisition module for the transfer sample corresponding to each sample moment, and the mapping moving result which is mapped and determined according to the state corresponding to the sample moment in the transfer sample is obtained through the sample adjustment module, the initial reward value in the transfer sample is adjusted according to the difference degree between the actual moving result and the mapping moving result, the moving target in the transfer sample is updated to the actual moving result, and the model training module is used for training the robot control model comprising the model based on the transfer sample. Compared with the problem of poor robot control effect in the prior art, the method and the device have the advantages that the initial reward value and the moving target in the transfer sample are updated and adjusted based on the actual movement result of the robot after the movement to be executed in the transfer sample is executed and the mapping movement result mapped and determined according to the corresponding state of the sample time in the transfer sample, successful experience of the updated and adjusted transfer sample can be guaranteed, the robot control model training effect is better, and accordingly the robot control effect is improved.

Optionally, the motion result obtaining module 502 includes:

the motion result determining unit is used for obtaining the historical motion trail of the robot and determining the actual motion result of the robot after the motion to be executed in the transfer sample is executed based on the historical motion trail.

Optionally, the sample adjustment module 503 includes:

A sample adjustment calculation unit for

Optionally, the model training module 504 includes:

The super-parameter adjusting unit is used for adjusting super-parameters in the robot control model comprising the strategy model based on the transfer sample;

and the model training unit is used for training the robot control model by utilizing the adjusted super parameters and the transfer samples.

Optionally, the super parameter is the step number; the super parameter adjustment unit includes:

Step number adjusting subunit for

Optionally, the super parameter is a weight parameter; the super parameter adjustment unit includes:

a weight parameter adjustment subunit for

The respective modules in the training device of the robot control model may be implemented in whole or in part by software, hardware, and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer equipment is used for storing data such as a corresponding state at the sample moment, an action to be executed, an initial rewarding value, a corresponding state at the next moment of the sample moment, a moving target and the like. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by the processor, implements a method of training a robot control model.

It will be appreciated by those skilled in the art that the structure shown in FIG. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the computer device to which the present inventive arrangements may be applied, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In an embodiment, a computer device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the training method of the robot control model provided in the above embodiment when the computer program is executed:

Processing the state of the robot and the moving target at the sample moment through a strategy model according to each sample moment to obtain a to-be-executed action at the sample moment, after the robot executes the to-be-executed action, obtaining the state and the initial rewarding value of the robot at the next moment of the sample moment, and forming a transfer sample corresponding to the sample moment by the corresponding state, the to-be-executed action, the initial rewarding value, the corresponding state and the moving target at the next moment of the sample moment;

For a transfer sample corresponding to each sample time, acquiring an actual motion result of the robot after the action to be executed in the transfer sample is executed and a mapping motion result mapped and determined according to the corresponding state of the sample time in the transfer sample;

according to the difference degree between the actual motion result and the mapping motion result, the initial rewarding value in the transfer sample is adjusted, and the motion target in the transfer sample is updated to be the actual motion result;

based on the transfer samples, a robot control model including a strategy model is trained.

In one embodiment, the processor when executing the computer program further performs the steps of:

Based on the transfer samples, adjusting hyper-parameters in a robot control model comprising a strategy model;

and training the robot control model by using the adjusted super parameters and the transfer samples.

The super parameter is the step number;

The super parameter is a weight parameter;

The implementation principle and technical effects of the above embodiment are similar to those of the above method embodiment, and are not repeated here.

In one embodiment, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the training method of the robot control model provided in the above embodiment:

In one embodiment, the computer program when executed by the processor further performs the steps of:

The super parameter is the step number;

The super parameter is a weight parameter;

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the training method of the robot control model provided by the above embodiment:

The super parameter is the step number;

The super parameter is a weight parameter;

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magneto-resistive random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (PHASE CHANGE Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A method of training a robot control model, the method comprising:

Training the robot control model by utilizing the adjusted super parameters and the transfer samples;

In the case that the hyper-parameter is a step number, the adjusting the hyper-parameter in the robot control model including the strategy model includes:

Wherein n is the number of steps, r 'is the adjusted reward value corresponding to the testing stage of the robot control model, and E r' is the expected value of the adjusted reward value corresponding to the testing stage of the robot control model;

In the case that the hyper-parameter is a weight parameter, the adjusting the hyper-parameter in the robot control model including the strategy model includes:

Wherein lambda is the adjusted weight parameter, lambda _t is the initial weight parameter, and n is the step number.

2. The method according to claim 1, wherein the obtaining the actual motion result of the robot after the motion to be performed in the transfer sample is performed comprises:

3. The method of claim 2, wherein determining an actual motion result of the robot after performing the action to be performed in the transfer sample based on the historical motion profile comprises:

And screening the state of the next moment of the sample moment in the historical motion track, and obtaining the actual motion result of the robot after the motion to be executed in the transfer sample is executed based on the state of the next moment of the sample moment.

4. The method of claim 1, wherein said adjusting the initial prize value in the transition sample based on a degree of difference between the actual motion outcome and the mapped motion outcome comprises:

5. The method of claim 1, wherein the robotic control model employs a deterministic strategy gradient algorithm.

6. The method of any one of claims 1 to 5, wherein the state of the robot comprises a geometric position of a robot arm of the robot, a robot arm pose and a robot arm speed of the robot.

7. A training device for a robot control model, the device comprising:

The model training module is used for adjusting super parameters in a robot control model comprising the strategy model based on the transfer sample; training the robot control model by utilizing the adjusted super parameters and the transfer samples; in the case that the hyper-parameter is a step number, the adjusting the hyper-parameter in the robot control model including the strategy model includes:

Wherein n is the number of steps, r 'is the adjusted reward value corresponding to the testing stage of the robot control model, and E r' is the expected value of the adjusted reward value corresponding to the testing stage of the robot control model; in the case that the hyper-parameter is a weight parameter, the adjusting the hyper-parameter in the robot control model including the strategy model includes:

8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 6 when the computer program is executed.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.