CN114047745B

CN114047745B - Robot motion control method, robot, computer device, and storage medium

Info

Publication number: CN114047745B
Application number: CN202111192042.6A
Authority: CN
Inventors: 石勇; 王锦红; 林勇; 李有兵; 李奕萍; 曾一新; 牟海荣; 彭伟; 罗志高
Original assignee: Guangzhou City Construction College
Current assignee: Guangzhou City Construction College
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2023-04-07
Anticipated expiration: 2041-10-13
Also published as: CN114047745A

Abstract

The invention discloses a robot motion control method, a robot, a computer device and a storage medium, wherein the robot motion control method comprises the steps of performing twin learning on an incentive value training set and an incentive value testing set to obtain an optimal incentive value, solving through a reverse reinforcement learning algorithm according to the optimal incentive value, an action set, a transition probability set and feedback quantity of the robot at a first moment, controlling the action of the robot at a second moment according to a solving result and the like. The method can output the optimal reward value under the condition of insufficient prior knowledge of the working environment, so that the reverse reinforcement learning algorithm can search the optimal action strategy based on the optimal reward value, the speed of searching the optimal action strategy is improved, the robot can quickly adapt to the complex environment under the condition of insufficient prior knowledge, the control precision and flexibility are improved, and the control of emergency obstacle avoidance, global path optimal planning and the like on the robot is realized. The invention is widely applied to the technical field of robots.

Description

Robot motion control method, robot, computer device, and storage medium

Technical Field

The invention relates to the technical field of robots, in particular to a robot motion control method based on twin reverse reinforcement learning, a robot, a computer device and a storage medium.

Background

The novel robots such as the bionic robot and the like have learning and simulation capabilities, can adapt to unstructured and complex environments, and are increasingly widely applied. However, the structure of a novel robot such as a bionic robot is complex, the optimization of the related control parameters is also complex, the optimal value of the control parameters is difficult to find by the current related control algorithm, and the control accuracy and the like are insufficient if the optimal value of the control parameters is not used for control. For example, the structural formula control algorithm applied to the bionic robot at present has insufficient adaptability to complex environments and cannot perform flexible control on accurate positioning.

Disclosure of Invention

In view of at least one of the above problems of the related art of robot control, such as insufficient adaptability to complex environments and difficulty in precise control, the present invention provides a method for controlling robot motion, a robot, a computer device and a storage medium based on twin reverse reinforcement learning.

In one aspect, an embodiment of the present invention includes a method for controlling robot motion based on twin reverse reinforcement learning, including:

acquiring an incentive value training set and an incentive value testing set;

twin learning is carried out on the reward value training set and the reward value testing set, and an optimal reward value is obtained;

acquiring an action set of a robot and a transition probability set corresponding to the action set;

acquiring feedback quantity of the robot at a first moment;

solving through a reverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set and the feedback quantity;

controlling the robot to act at a second moment according to the solving result of the reverse reinforcement learning algorithm; the second time is a time after the first time.

Further, the twin learning of the reward value training set and the reward value test set to obtain an optimal reward value includes:

acquiring a twin learning model;

setting a target reward value; the initial value of the target reward value is set according to a selected value in the reward value training set;

starting from the initial value of the target reward value, selecting a new value in the reward value training set according to the gradient descending direction to update the target reward value;

taking the target prize value corresponding to the smallest first expected loss as the optimal prize value; the first expected loss is determined according to an output value of the twin learning model after the target reward value and the value in the reward value test set are input into the twin learning model.

Further, the robot motion control method further includes:

pre-training the twin learning model.

Further, the pre-training the twin learning model comprises:

configuring parameter values of the twin learning model;

setting an initial value of the parameter value;

updating the parameter values in the gradient descending direction from the initial values of the parameter values;

when the parameter value is updated to correspond to the minimum second expected loss, ending the pre-training of the twin learning model; the second expected loss is determined from the output value of the twin learning model after the values in the reward value training set and the values in the reward value test set are input to the twin learning model.

Further, the solving by an inverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set, and the feedback amount includes:

by the formula

Determination of A ₁ (ii) a In the formula, A ₁ Represents an action in the action set, A represents a division A in the action set ₁ Other actions than P _A Representing the transition probabilities in the setA transition probability matrix corresponding to A @>

Representing the sum of A in the transition probability set ₁ Corresponding transition probability matrix, I denotes identity matrix, R ^* Represents the optimal prize value and r represents the feedback amount.

Further, the controlling the robot to move at a second time according to the solving result of the inverse reinforcement learning algorithm includes:

get action A ₁ A corresponding control instruction;

and sending the control instruction to the robot.

Further, the robot motion control method is executed when the robot detects an obstacle or does not reach a target point, and is stopped when the robot reaches the target point.

In another aspect, the invention also includes a robot comprising:

the sensing module is used for acquiring the feedback quantity of the robot at a first moment;

the processing module is used for acquiring an incentive value training set and an incentive value testing set, performing twin learning on the incentive value training set and the incentive value testing set to acquire an optimal incentive value, acquiring an action set of the robot and a transition probability set corresponding to the action set, and solving through a reverse reinforcement learning algorithm according to the optimal incentive value, the action set, the transition probability set and the feedback quantity;

the driving module is used for controlling the action of the robot at a second moment according to the solving result of the reverse reinforcement learning algorithm; the second time is a time after the first time.

In another aspect, the present invention further includes a computer device including a memory for storing at least one program and a processor for loading the at least one program to perform the twin inverse reinforcement learning based robot motion control method in an embodiment.

In another aspect, the present invention also includes a storage medium having stored therein a processor-executable program for performing the twin reverse reinforcement learning-based robot motion control method in the embodiment when the processor-executable program is executed by a processor.

The invention has the beneficial effects that: the method for controlling the motion of the robot based on the twin reverse reinforcement learning in the embodiment combines the advantages of the small sample twin learning and the reverse reinforcement learning, can acquire the reward value training set and the reward value testing set of a small sample for training and output the optimal reward value under the condition that the priori knowledge of the working environment is insufficient by using the twin learning algorithm, and reconstructs a reward function for the reverse reinforcement learning through the small sample twin learning, so that the reverse reinforcement learning algorithm can search an optimal action strategy based on the optimal reward value to improve the speed of searching the optimal action strategy, the robot can quickly adapt to the complex environment under the condition that the priori knowledge is insufficient, the control precision and flexibility are improved, the emergency obstacle avoidance and the global path optimal planning are realized, and the robot is accurately controlled.

Drawings

Fig. 1 is a flowchart of a robot motion control method based on twin reverse reinforcement learning in an embodiment;

fig. 2 is a schematic diagram of a robot motion control method based on twin reverse reinforcement learning in an embodiment.

Detailed Description

The robot motion control method based on the twin reverse reinforcement learning in this embodiment may be executed by a computer serving as an upper computer, and the robot serving as a lower computer is connected to the upper computer through a Modbus communication protocol, so that the robot and the computer may communicate with each other, for example, the robot uploads feedback quantities such as obstacles detected in a motion process to the computer, and the computer sends a control instruction to the robot.

In this embodiment, referring to fig. 1, a method for controlling robot motion based on twin reverse reinforcement learning includes the following steps:

s1, acquiring an incentive value training set and an incentive value testing set;

s2, twin learning is carried out on the reward value training set and the reward value testing set, and an optimal reward value is obtained;

s3, acquiring an action set of the robot and a transition probability set corresponding to the action set;

s4, acquiring the feedback quantity of the robot at the first moment;

s5, solving through a reverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set and the feedback quantity;

s6, controlling the robot to move at a second moment according to a solving result of the reverse reinforcement learning algorithm; the second time is a time after the first time.

In step S1, the reward value training set is set as R _tr (R _tr1 ,R _tr2 ,…,R _trk ) Including R in the prize value training set _tr1 、R _tr2 Waiting a plurality of reward values for training; setting a test set of reward values to R _test (R _test1 ,R _test2 ,…,R _testk ) The reward value test set includes R _test1 、R _test2 And a plurality of prize values for testing.

When step S2 is executed, the following steps may be specifically executed:

s201, acquiring a twin learning model;

s202, setting a target reward value; the initial value of the target reward value is set according to a value selected in the reward value training set;

s203, starting from the initial value of the target reward value, selecting a new value in a reward value training set according to the gradient descending direction to update the target reward value;

s204, taking the target reward value corresponding to the minimum first expected loss as an optimal reward value; the first expected loss is determined based on the output value of the twin learning model after the target reward value and the value in the reward value test set are input to the twin learning model.

In step S201, the pre-trained twin learning model may be acquired, and specifically, the program code may be read by a computer to run the twin learning model. If the twin learning model is not pre-trained, the pre-training step may be performed first when performing step S201.

In step S202, a target prize value is set. The target bonus value is a value that can be updated, and when initializing the target bonus value in step S202, the target bonus value can be updated in a bonus value training set R _tr (R _tr1 ,R _tr2 ,…,R _trk ) To select a value (e.g., R) _tr2 ) Initial value R as a target prize value ₀ 。

Step S203 and step S204 may be executed as one loop body. Specifically, after setting the initial value of the target bonus value, the initial value R of the target bonus value may be set ₀ ＝R _tr2 With a value in the test set of prize values (e.g. R) _test2 ) Input together into a twin learning model, i.e. with R _tr2 And R _test2 As an input value of the twin learning model, it is processed by the twin learning model and has as an output value a probability value that has the meaning of the input value of the twin learning model (e.g., R in the present embodiment) _tr2 And R _test2 ) The greater the probability value, the higher the similarity between the input values of the twin learning model, and the optimal model parameter can be expressed as

According to twin learning model pair R _tr2 And R _test2 The probability value obtained by processing is compared with the preset expected output, the first expected loss is obtained by calculation, when the first expected loss does not converge or does not exceed the preset first threshold value, the first expected loss can be judged not to reach the minimum, step S203 can be executed, and from R, the probability value is obtained by processing _tr2 Begin to update the target prize value by selecting a new value in the prize value training set in the direction of the gradient decrease (e.g., the updated target prize value may be R _tr3 ) Then R is added _tr3 With a value in the test set of prize values (e.g. R) _test3 ) Are input together into twinLearning models, i.e. with R _tr3 And R _test3 As an input value of the twin learning model, the twin learning model processes the input value, takes a probability value as an output value, compares the probability value with a preset expected output, and calculates a new first expected loss, and when the new first expected loss is not converged or does not exceed a preset first threshold, it can be judged that the new first expected loss does not reach the minimum, and the step S203 can be continuously executed to update the target reward value; when the new first expected loss converges or exceeds a preset first threshold value, the new first expected loss can be judged to be minimum, and the target reward value after the last update is taken as the optimal reward value R ^* 。

In step S201, if the used twin learning model is not pre-trained, the following steps may be performed to pre-train the twin learning model:

p1, configuring parameter values of the twin learning model;

p2, setting an initial value of a parameter value;

p3, starting from the initial value of the parameter value, updating the parameter value according to the gradient descending direction;

p4, when the parameter value is updated to the corresponding minimum second expected loss, finishing the pre-training of the twin learning model; the second expected loss is determined from the output value of the twin learning model after inputting the values in the reward value training set and the values in the reward value test set into the twin learning model.

In steps P1 and P2, the parameter values of twin learning model are processed

Configured to set an initial value of a parameter value of the twin learning model to &>

Steps P3 and P4 may be performed as one loop body. In particular, the reward value training set R may be _tr (R _tr1 ,R _tr2 ,…,R _trk ) One value of (e.g. R) _tr1 ) And a prize value test set R _test (R _test1 ,R _test2 ,…,R _testk ) One value of (R) _test1 ) Input together into a twin learning model, i.e. with R _tr1 And R _test1 The input value of the twin learning model is processed by the twin learning model and a probability value is used as an output value. According to twin learning model pair R _tr1 And R _test1 And comparing the probability value obtained by processing with the preset expected output, and calculating to obtain a second expected loss. Specifically, the model parameters after training on the nth test task are defined as

The total loss function can be expressed as

Where N denotes the total number of test tasks and l () denotes a loss function of a single test task which can be evaluated by the formula +>

A second expected loss is calculated.

When the second expected loss does not converge or does not exceed a preset second threshold, it may be judged that the second expected loss does not reach a minimum, and step P3 may be performed to learn an initial value of a parameter value of the model from the twin

Initially, the parameter value is updated in the direction of decreasing gradient, in dependence on the magnitude of the second expected loss, in particular the parameter value ≥ of the twin learning model>

The formula used can be +>

Wherein +>

Representing the second expected loss and η representing the learning rate of the twin learning model. Updating parameter values in a twin learning model>

Thereafter, the reward value training set R may continue _tr (R _tr1 ,R _tr2 ,…,R _trk ) One value of (e.g. R) _tr2 ) And a bonus value test set R _test (R _test1 ,R _test2 ,…,R _testk ) One value of (R) _test2 ) Input into the twin learning model together, according to the twin learning model pair R _tr2 And R _test2 And comparing the probability value obtained by processing with the preset expected output, calculating to obtain a second expected loss, and judging whether the second expected loss reaches the minimum. Specifically, when the second expected loss does not converge or does not exceed the preset second threshold, it may be determined that the second expected loss does not reach the minimum, and the step P3 may be executed again; when the second expected loss converges or exceeds the preset second threshold, it may be determined that the second expected loss is minimized, step P4 may be performed to determine the last updated parameter value ≥ based on the twin learning model>

Saved as parameter values of the twin learning model used for performing the steps P201 to P204.

In this embodiment, by using the twin learning model, the optimal reward value used for executing the reverse reinforcement learning algorithm can be obtained, so that the reverse reinforcement learning algorithm can generate a more accurate control strategy. The twin learning model has the advantages that the gradient descent optimization and the loss function optimization can be carried out on the reward function in the training set and the testing set, so that better generalization performance is obtained, and the reward value can be reconstructed by the twin learning model, so that the bionic robot can adapt to the environment more flexibly; the twin learning model can be trained under the condition of a small sample, so that the characteristics of the reward function can be effectively learned under the condition of the small sample, and the over-fitting phenomenon is not easy to occur.

In the present embodiment, the optimal reward value R is obtained ^* Then, step S3 may be executed to obtain an action set of the robot and a transition probability set corresponding to the action set. Specifically, the motion set of the robot may be represented as a set a of a set of motion matrices, the motion set including the motion of the left leg, the motion of the right leg, the motion of the arm swinging up, the motion of the arm swinging down, and the like of the robot, and thus the motion set a may be represented as a = { l = { (l) } _left ,l _right ,a _left ,a _right }. The robot can correspond to different state matrixes in different states, and the state matrixes form a state set S. By determining transition probability matrixes of the robot among different states under various actions, the transition probability matrixes can be combined into a transition probability set P.

The tuple (S, a, P, R) consisting of the action set a, the state set S, the transition probability set P and the reward value R may represent a markov decision process, which may describe a sequence decision problem, and the motion control of the biomimetic robot also belongs to the sequence decision problem, so that the motion control of the robot may be performed through the markov decision process.

In this embodiment, the time when step S4 is executed is the first time, and the feedback amount r of the robot at the first time is acquired by a sensor mounted on the robot. In order to obtain the action to be performed by the robot to be controlled at a second time after the first time, the solution may be performed by an inverse reinforcement learning algorithm.

In step S5, according to the inverse reinforcement learning algorithm, if a strategy pi (S) ≡ A ₁ Is the optimal strategy, then satisfies

Wherein A is ₁ Represents an action in the action set, A represents a division A in the action set ₁ Other actions than P _A Represents a transition probability matrix corresponding to A in the transition probability set, and->

Representing the sum of A in the transition probability set ₁ Corresponding transition probability matrix, I denotes identity matrix, R ^* Representing the optimal prize value and r the amount of feedback. Thus, by checking whether the formula is satisfied>

An action A may be determined from the set of actions ₁ The robot can be controlled to execute the action A at the second moment ₁ 。

In step S6, in the determination action A ₁ Thereafter, the computer reads action A ₁ And the corresponding control command of the robot limb servo motor is sent to the robot, so that the robot limb servo motor acts according to the control command, and the robot executes corresponding actions of the left leg, the right leg, the arm swinging up, the arm swinging down and the like.

In this embodiment, the principle of the robot motion control method based on twin reverse reinforcement learning is shown in fig. 2. Referring to fig. 2, the robot is provided with a controller for small sample twin reverse reinforcement learning, the feedback quantity r can be continuously collected by a sensor to perform action feedback on a computer, and the controller can receive an action control instruction sent by the computer, so that four-limb servo motors of the robot are controlled.

Referring to fig. 2, a controller for small sample twin reverse reinforcement learning has a plurality of target points set therein, and these stepwise target points form a target path of the robot. The controller may perform steps S1-S6 before the robot movement reaches a target point, and may end performing steps S1-S6 when the robot movement reaches a target point.

Referring to fig. 2, the robot may further detect whether an obstacle is encountered during the movement, if an obstacle is encountered, the controller performs steps S1 to S6, if no obstacle is encountered, the robot determines whether a target point is reached, before the robot moves to a target point, the controller may perform steps S1 to S6, and when the robot moves to a target point, the controller ends performing steps S1 to S6.

In the method for controlling the motion of the robot based on the twin reverse reinforcement learning in the embodiment, by using the twin learning algorithm, the reward value training set and the reward value testing set of the small sample can be obtained for training under the condition that the prior knowledge of the working environment is insufficient, and the optimal reward value is output, so that the reverse reinforcement learning algorithm can accurately control the robot based on the optimal reward value. The method for controlling the motion of the robot based on the twin reverse reinforcement learning in the embodiment is equivalent to combining the advantages of the small sample twin learning and the reverse reinforcement learning, and can reconstruct a reward function for the reverse reinforcement learning through the small sample twin learning and find the optimal action strategy so as to improve the speed of finding the optimal action strategy, so that the robot can quickly adapt to a complex environment under the condition of insufficient prior knowledge, improve the control precision and flexibility, and realize emergency obstacle avoidance and optimal planning of a global path.

In this embodiment, a robot may be manufactured, where the robot includes a sensing module, a processing module, and a driving module, the sensing module may be sensors such as an obstacle detection sensor and a distance sensor, the processing module may be a controller for small sample twin reverse reinforcement learning, specifically, a single chip microcomputer or a dedicated robot controller, and the driving module may be a robot limb servo motor. In this embodiment, the sensor module executes step S4, the processing module executes steps S1, S2, S3, and S5, and the driving module executes step S6, so that the robot can realize the advantages of the robot motion control method based on the twin reverse reinforcement learning in this embodiment.

The computer program may be written according to the method for controlling a motion of a robot based on twin reverse reinforcement learning in this embodiment, and the computer program may be written in a memory of a computer device or an independent storage medium, and when the computer program is read out, the computer program may instruct a processor to execute the method for controlling a motion of a robot based on twin reverse reinforcement learning in this embodiment, thereby achieving the same technical effects as those of the method embodiments.

It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of up, down, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the components of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one type of element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.

It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.

Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.

Further, the method may be implemented in any type of computing platform operatively connected to a suitable connection, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, or the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.

A computer program can be applied to input data to perform the functions described in this embodiment to convert the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.

The present invention is not limited to the above embodiments, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims

1. A robot motion control method based on twin reverse reinforcement learning is characterized by comprising the following steps:

acquiring an incentive value training set and an incentive value testing set;

acquiring the feedback quantity of the robot at a first moment;

controlling the robot to act at a second moment according to the solving result of the reverse reinforcement learning algorithm; the second moment is a moment after the first moment;

the twin learning of the reward value training set and the reward value test set to obtain the optimal reward value comprises:

acquiring a twin learning model;

taking the target prize value corresponding to the smallest first expected loss as the optimal prize value; the first expected loss is determined according to the output value of the twin learning model after the target reward value and the value in the reward value test set are input into the twin learning model;

solving through a reverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set and the feedback quantity, wherein the solving comprises the following steps:

by the formula

Determination of A ₁ (ii) a In the formula, A ₁ Represents an action in the action set, A represents a division A in the action set ₁ Other actions than P _A Represents a transition probability matrix corresponding to A in the set of transition probabilities, and->

2. The robot motion control method of claim 1, further comprising:

pre-training the twin learning model.

3. The robot motion control method of claim 2, wherein the pre-training the twin learning model comprises:

configuring parameter values of the twin learning model;

setting an initial value of the parameter value;

when the parameter value is updated to correspond to the minimum second expected loss, ending the pre-training of the twin learning model; the second expected loss is determined according to the output value of the twin learning model after the values in the reward value training set and the values in the reward value testing set are input into the twin learning model.

4. The robot motion control method according to claim 1, wherein the controlling the robot motion at the second time based on the result of the solution of the inverse reinforcement learning algorithm includes:

get action A ₁ A corresponding control instruction;

and sending the control instruction to the robot.

5. The robot motion control method according to claim 1, wherein the robot motion control method is executed when the robot detects an obstacle or does not reach a target point, and is stopped when the robot reaches the target point.

6. A robot, characterized in that the robot comprises:

the driving module is used for controlling the action of the robot at a second moment according to the solving result of the reverse reinforcement learning algorithm; the second moment is a moment after the first moment;

the twin learning is carried out on the reward value training set and the reward value testing set to obtain the optimal reward value, and the method comprises the following steps:

acquiring a twin learning model;

by the formula

7. A computer arrangement comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of any one of claims 1-5.

8. A storage medium having stored thereon a program executable by a processor, the program executable by the processor when executed by the processor being adapted to perform the method of any of claims 1 to 5.