CN114047745B - Robot motion control method, robot, computer device, and storage medium - Google Patents

Robot motion control method, robot, computer device, and storage medium Download PDF

Info

Publication number
CN114047745B
CN114047745B CN202111192042.6A CN202111192042A CN114047745B CN 114047745 B CN114047745 B CN 114047745B CN 202111192042 A CN202111192042 A CN 202111192042A CN 114047745 B CN114047745 B CN 114047745B
Authority
CN
China
Prior art keywords
value
robot
twin
optimal
reward value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111192042.6A
Other languages
Chinese (zh)
Other versions
CN114047745A (en
Inventor
石勇
王锦红
林勇
李有兵
李奕萍
曾一新
牟海荣
彭伟
罗志高
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou City Construction College
Original Assignee
Guangzhou City Construction College
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou City Construction College filed Critical Guangzhou City Construction College
Priority to CN202111192042.6A priority Critical patent/CN114047745B/en
Publication of CN114047745A publication Critical patent/CN114047745A/en
Application granted granted Critical
Publication of CN114047745B publication Critical patent/CN114047745B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0219Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory ensuring the processing of the whole working surface
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Automation & Control Theory (AREA)
  • Remote Sensing (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Evolutionary Computation (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Manipulator (AREA)

Abstract

The invention discloses a robot motion control method, a robot, a computer device and a storage medium, wherein the robot motion control method comprises the steps of performing twin learning on an incentive value training set and an incentive value testing set to obtain an optimal incentive value, solving through a reverse reinforcement learning algorithm according to the optimal incentive value, an action set, a transition probability set and feedback quantity of the robot at a first moment, controlling the action of the robot at a second moment according to a solving result and the like. The method can output the optimal reward value under the condition of insufficient prior knowledge of the working environment, so that the reverse reinforcement learning algorithm can search the optimal action strategy based on the optimal reward value, the speed of searching the optimal action strategy is improved, the robot can quickly adapt to the complex environment under the condition of insufficient prior knowledge, the control precision and flexibility are improved, and the control of emergency obstacle avoidance, global path optimal planning and the like on the robot is realized. The invention is widely applied to the technical field of robots.

Description

Robot motion control method, robot, computer device, and storage medium
Technical Field
The invention relates to the technical field of robots, in particular to a robot motion control method based on twin reverse reinforcement learning, a robot, a computer device and a storage medium.
Background
The novel robots such as the bionic robot and the like have learning and simulation capabilities, can adapt to unstructured and complex environments, and are increasingly widely applied. However, the structure of a novel robot such as a bionic robot is complex, the optimization of the related control parameters is also complex, the optimal value of the control parameters is difficult to find by the current related control algorithm, and the control accuracy and the like are insufficient if the optimal value of the control parameters is not used for control. For example, the structural formula control algorithm applied to the bionic robot at present has insufficient adaptability to complex environments and cannot perform flexible control on accurate positioning.
Disclosure of Invention
In view of at least one of the above problems of the related art of robot control, such as insufficient adaptability to complex environments and difficulty in precise control, the present invention provides a method for controlling robot motion, a robot, a computer device and a storage medium based on twin reverse reinforcement learning.
In one aspect, an embodiment of the present invention includes a method for controlling robot motion based on twin reverse reinforcement learning, including:
acquiring an incentive value training set and an incentive value testing set;
twin learning is carried out on the reward value training set and the reward value testing set, and an optimal reward value is obtained;
acquiring an action set of a robot and a transition probability set corresponding to the action set;
acquiring feedback quantity of the robot at a first moment;
solving through a reverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set and the feedback quantity;
controlling the robot to act at a second moment according to the solving result of the reverse reinforcement learning algorithm; the second time is a time after the first time.
Further, the twin learning of the reward value training set and the reward value test set to obtain an optimal reward value includes:
acquiring a twin learning model;
setting a target reward value; the initial value of the target reward value is set according to a selected value in the reward value training set;
starting from the initial value of the target reward value, selecting a new value in the reward value training set according to the gradient descending direction to update the target reward value;
taking the target prize value corresponding to the smallest first expected loss as the optimal prize value; the first expected loss is determined according to an output value of the twin learning model after the target reward value and the value in the reward value test set are input into the twin learning model.
Further, the robot motion control method further includes:
pre-training the twin learning model.
Further, the pre-training the twin learning model comprises:
configuring parameter values of the twin learning model;
setting an initial value of the parameter value;
updating the parameter values in the gradient descending direction from the initial values of the parameter values;
when the parameter value is updated to correspond to the minimum second expected loss, ending the pre-training of the twin learning model; the second expected loss is determined from the output value of the twin learning model after the values in the reward value training set and the values in the reward value test set are input to the twin learning model.
Further, the solving by an inverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set, and the feedback amount includes:
by the formula
Figure BDA0003301577960000021
Determination of A 1 (ii) a In the formula, A 1 Represents an action in the action set, A represents a division A in the action set 1 Other actions than P A Representing the transition probabilities in the setA transition probability matrix corresponding to A @>
Figure BDA0003301577960000022
Representing the sum of A in the transition probability set 1 Corresponding transition probability matrix, I denotes identity matrix, R * Represents the optimal prize value and r represents the feedback amount.
Further, the controlling the robot to move at a second time according to the solving result of the inverse reinforcement learning algorithm includes:
get action A 1 A corresponding control instruction;
and sending the control instruction to the robot.
Further, the robot motion control method is executed when the robot detects an obstacle or does not reach a target point, and is stopped when the robot reaches the target point.
In another aspect, the invention also includes a robot comprising:
the sensing module is used for acquiring the feedback quantity of the robot at a first moment;
the processing module is used for acquiring an incentive value training set and an incentive value testing set, performing twin learning on the incentive value training set and the incentive value testing set to acquire an optimal incentive value, acquiring an action set of the robot and a transition probability set corresponding to the action set, and solving through a reverse reinforcement learning algorithm according to the optimal incentive value, the action set, the transition probability set and the feedback quantity;
the driving module is used for controlling the action of the robot at a second moment according to the solving result of the reverse reinforcement learning algorithm; the second time is a time after the first time.
In another aspect, the present invention further includes a computer device including a memory for storing at least one program and a processor for loading the at least one program to perform the twin inverse reinforcement learning based robot motion control method in an embodiment.
In another aspect, the present invention also includes a storage medium having stored therein a processor-executable program for performing the twin reverse reinforcement learning-based robot motion control method in the embodiment when the processor-executable program is executed by a processor.
The invention has the beneficial effects that: the method for controlling the motion of the robot based on the twin reverse reinforcement learning in the embodiment combines the advantages of the small sample twin learning and the reverse reinforcement learning, can acquire the reward value training set and the reward value testing set of a small sample for training and output the optimal reward value under the condition that the priori knowledge of the working environment is insufficient by using the twin learning algorithm, and reconstructs a reward function for the reverse reinforcement learning through the small sample twin learning, so that the reverse reinforcement learning algorithm can search an optimal action strategy based on the optimal reward value to improve the speed of searching the optimal action strategy, the robot can quickly adapt to the complex environment under the condition that the priori knowledge is insufficient, the control precision and flexibility are improved, the emergency obstacle avoidance and the global path optimal planning are realized, and the robot is accurately controlled.
Drawings
Fig. 1 is a flowchart of a robot motion control method based on twin reverse reinforcement learning in an embodiment;
fig. 2 is a schematic diagram of a robot motion control method based on twin reverse reinforcement learning in an embodiment.
Detailed Description
The robot motion control method based on the twin reverse reinforcement learning in this embodiment may be executed by a computer serving as an upper computer, and the robot serving as a lower computer is connected to the upper computer through a Modbus communication protocol, so that the robot and the computer may communicate with each other, for example, the robot uploads feedback quantities such as obstacles detected in a motion process to the computer, and the computer sends a control instruction to the robot.
In this embodiment, referring to fig. 1, a method for controlling robot motion based on twin reverse reinforcement learning includes the following steps:
s1, acquiring an incentive value training set and an incentive value testing set;
s2, twin learning is carried out on the reward value training set and the reward value testing set, and an optimal reward value is obtained;
s3, acquiring an action set of the robot and a transition probability set corresponding to the action set;
s4, acquiring the feedback quantity of the robot at the first moment;
s5, solving through a reverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set and the feedback quantity;
s6, controlling the robot to move at a second moment according to a solving result of the reverse reinforcement learning algorithm; the second time is a time after the first time.
In step S1, the reward value training set is set as R tr (R tr1 ,R tr2 ,…,R trk ) Including R in the prize value training set tr1 、R tr2 Waiting a plurality of reward values for training; setting a test set of reward values to R test (R test1 ,R test2 ,…,R testk ) The reward value test set includes R test1 、R test2 And a plurality of prize values for testing.
When step S2 is executed, the following steps may be specifically executed:
s201, acquiring a twin learning model;
s202, setting a target reward value; the initial value of the target reward value is set according to a value selected in the reward value training set;
s203, starting from the initial value of the target reward value, selecting a new value in a reward value training set according to the gradient descending direction to update the target reward value;
s204, taking the target reward value corresponding to the minimum first expected loss as an optimal reward value; the first expected loss is determined based on the output value of the twin learning model after the target reward value and the value in the reward value test set are input to the twin learning model.
In step S201, the pre-trained twin learning model may be acquired, and specifically, the program code may be read by a computer to run the twin learning model. If the twin learning model is not pre-trained, the pre-training step may be performed first when performing step S201.
In step S202, a target prize value is set. The target bonus value is a value that can be updated, and when initializing the target bonus value in step S202, the target bonus value can be updated in a bonus value training set R tr (R tr1 ,R tr2 ,…,R trk ) To select a value (e.g., R) tr2 ) Initial value R as a target prize value 0
Step S203 and step S204 may be executed as one loop body. Specifically, after setting the initial value of the target bonus value, the initial value R of the target bonus value may be set 0 =R tr2 With a value in the test set of prize values (e.g. R) test2 ) Input together into a twin learning model, i.e. with R tr2 And R test2 As an input value of the twin learning model, it is processed by the twin learning model and has as an output value a probability value that has the meaning of the input value of the twin learning model (e.g., R in the present embodiment) tr2 And R test2 ) The greater the probability value, the higher the similarity between the input values of the twin learning model, and the optimal model parameter can be expressed as
Figure BDA0003301577960000041
According to twin learning model pair R tr2 And R test2 The probability value obtained by processing is compared with the preset expected output, the first expected loss is obtained by calculation, when the first expected loss does not converge or does not exceed the preset first threshold value, the first expected loss can be judged not to reach the minimum, step S203 can be executed, and from R, the probability value is obtained by processing tr2 Begin to update the target prize value by selecting a new value in the prize value training set in the direction of the gradient decrease (e.g., the updated target prize value may be R tr3 ) Then R is added tr3 With a value in the test set of prize values (e.g. R) test3 ) Are input together into twinLearning models, i.e. with R tr3 And R test3 As an input value of the twin learning model, the twin learning model processes the input value, takes a probability value as an output value, compares the probability value with a preset expected output, and calculates a new first expected loss, and when the new first expected loss is not converged or does not exceed a preset first threshold, it can be judged that the new first expected loss does not reach the minimum, and the step S203 can be continuously executed to update the target reward value; when the new first expected loss converges or exceeds a preset first threshold value, the new first expected loss can be judged to be minimum, and the target reward value after the last update is taken as the optimal reward value R *
In step S201, if the used twin learning model is not pre-trained, the following steps may be performed to pre-train the twin learning model:
p1, configuring parameter values of the twin learning model;
p2, setting an initial value of a parameter value;
p3, starting from the initial value of the parameter value, updating the parameter value according to the gradient descending direction;
p4, when the parameter value is updated to the corresponding minimum second expected loss, finishing the pre-training of the twin learning model; the second expected loss is determined from the output value of the twin learning model after inputting the values in the reward value training set and the values in the reward value test set into the twin learning model.
In steps P1 and P2, the parameter values of twin learning model are processed
Figure BDA0003301577960000051
Configured to set an initial value of a parameter value of the twin learning model to &>
Figure BDA0003301577960000052
Steps P3 and P4 may be performed as one loop body. In particular, the reward value training set R may be tr (R tr1 ,R tr2 ,…,R trk ) One value of (e.g. R) tr1 ) And a prize value test set R test (R test1 ,R test2 ,…,R testk ) One value of (R) test1 ) Input together into a twin learning model, i.e. with R tr1 And R test1 The input value of the twin learning model is processed by the twin learning model and a probability value is used as an output value. According to twin learning model pair R tr1 And R test1 And comparing the probability value obtained by processing with the preset expected output, and calculating to obtain a second expected loss. Specifically, the model parameters after training on the nth test task are defined as
Figure BDA0003301577960000053
The total loss function can be expressed as
Figure BDA0003301577960000054
Where N denotes the total number of test tasks and l () denotes a loss function of a single test task which can be evaluated by the formula +>
Figure BDA0003301577960000055
A second expected loss is calculated.
When the second expected loss does not converge or does not exceed a preset second threshold, it may be judged that the second expected loss does not reach a minimum, and step P3 may be performed to learn an initial value of a parameter value of the model from the twin
Figure BDA0003301577960000061
Initially, the parameter value is updated in the direction of decreasing gradient, in dependence on the magnitude of the second expected loss, in particular the parameter value ≥ of the twin learning model>
Figure BDA0003301577960000062
The formula used can be +>
Figure BDA0003301577960000063
Wherein +>
Figure BDA0003301577960000064
Representing the second expected loss and η representing the learning rate of the twin learning model. Updating parameter values in a twin learning model>
Figure BDA0003301577960000065
Thereafter, the reward value training set R may continue tr (R tr1 ,R tr2 ,…,R trk ) One value of (e.g. R) tr2 ) And a bonus value test set R test (R test1 ,R test2 ,…,R testk ) One value of (R) test2 ) Input into the twin learning model together, according to the twin learning model pair R tr2 And R test2 And comparing the probability value obtained by processing with the preset expected output, calculating to obtain a second expected loss, and judging whether the second expected loss reaches the minimum. Specifically, when the second expected loss does not converge or does not exceed the preset second threshold, it may be determined that the second expected loss does not reach the minimum, and the step P3 may be executed again; when the second expected loss converges or exceeds the preset second threshold, it may be determined that the second expected loss is minimized, step P4 may be performed to determine the last updated parameter value ≥ based on the twin learning model>
Figure BDA0003301577960000066
Saved as parameter values of the twin learning model used for performing the steps P201 to P204.
In this embodiment, by using the twin learning model, the optimal reward value used for executing the reverse reinforcement learning algorithm can be obtained, so that the reverse reinforcement learning algorithm can generate a more accurate control strategy. The twin learning model has the advantages that the gradient descent optimization and the loss function optimization can be carried out on the reward function in the training set and the testing set, so that better generalization performance is obtained, and the reward value can be reconstructed by the twin learning model, so that the bionic robot can adapt to the environment more flexibly; the twin learning model can be trained under the condition of a small sample, so that the characteristics of the reward function can be effectively learned under the condition of the small sample, and the over-fitting phenomenon is not easy to occur.
In the present embodiment, the optimal reward value R is obtained * Then, step S3 may be executed to obtain an action set of the robot and a transition probability set corresponding to the action set. Specifically, the motion set of the robot may be represented as a set a of a set of motion matrices, the motion set including the motion of the left leg, the motion of the right leg, the motion of the arm swinging up, the motion of the arm swinging down, and the like of the robot, and thus the motion set a may be represented as a = { l = { (l) } left ,l right ,a left ,a right }. The robot can correspond to different state matrixes in different states, and the state matrixes form a state set S. By determining transition probability matrixes of the robot among different states under various actions, the transition probability matrixes can be combined into a transition probability set P.
The tuple (S, a, P, R) consisting of the action set a, the state set S, the transition probability set P and the reward value R may represent a markov decision process, which may describe a sequence decision problem, and the motion control of the biomimetic robot also belongs to the sequence decision problem, so that the motion control of the robot may be performed through the markov decision process.
In this embodiment, the time when step S4 is executed is the first time, and the feedback amount r of the robot at the first time is acquired by a sensor mounted on the robot. In order to obtain the action to be performed by the robot to be controlled at a second time after the first time, the solution may be performed by an inverse reinforcement learning algorithm.
In step S5, according to the inverse reinforcement learning algorithm, if a strategy pi (S) ≡ A 1 Is the optimal strategy, then satisfies
Figure BDA0003301577960000071
Wherein A is 1 Represents an action in the action set, A represents a division A in the action set 1 Other actions than P A Represents a transition probability matrix corresponding to A in the transition probability set, and->
Figure BDA0003301577960000072
Representing the sum of A in the transition probability set 1 Corresponding transition probability matrix, I denotes identity matrix, R * Representing the optimal prize value and r the amount of feedback. Thus, by checking whether the formula is satisfied>
Figure BDA0003301577960000073
An action A may be determined from the set of actions 1 The robot can be controlled to execute the action A at the second moment 1
In step S6, in the determination action A 1 Thereafter, the computer reads action A 1 And the corresponding control command of the robot limb servo motor is sent to the robot, so that the robot limb servo motor acts according to the control command, and the robot executes corresponding actions of the left leg, the right leg, the arm swinging up, the arm swinging down and the like.
In this embodiment, the principle of the robot motion control method based on twin reverse reinforcement learning is shown in fig. 2. Referring to fig. 2, the robot is provided with a controller for small sample twin reverse reinforcement learning, the feedback quantity r can be continuously collected by a sensor to perform action feedback on a computer, and the controller can receive an action control instruction sent by the computer, so that four-limb servo motors of the robot are controlled.
Referring to fig. 2, a controller for small sample twin reverse reinforcement learning has a plurality of target points set therein, and these stepwise target points form a target path of the robot. The controller may perform steps S1-S6 before the robot movement reaches a target point, and may end performing steps S1-S6 when the robot movement reaches a target point.
Referring to fig. 2, the robot may further detect whether an obstacle is encountered during the movement, if an obstacle is encountered, the controller performs steps S1 to S6, if no obstacle is encountered, the robot determines whether a target point is reached, before the robot moves to a target point, the controller may perform steps S1 to S6, and when the robot moves to a target point, the controller ends performing steps S1 to S6.
In the method for controlling the motion of the robot based on the twin reverse reinforcement learning in the embodiment, by using the twin learning algorithm, the reward value training set and the reward value testing set of the small sample can be obtained for training under the condition that the prior knowledge of the working environment is insufficient, and the optimal reward value is output, so that the reverse reinforcement learning algorithm can accurately control the robot based on the optimal reward value. The method for controlling the motion of the robot based on the twin reverse reinforcement learning in the embodiment is equivalent to combining the advantages of the small sample twin learning and the reverse reinforcement learning, and can reconstruct a reward function for the reverse reinforcement learning through the small sample twin learning and find the optimal action strategy so as to improve the speed of finding the optimal action strategy, so that the robot can quickly adapt to a complex environment under the condition of insufficient prior knowledge, improve the control precision and flexibility, and realize emergency obstacle avoidance and optimal planning of a global path.
In this embodiment, a robot may be manufactured, where the robot includes a sensing module, a processing module, and a driving module, the sensing module may be sensors such as an obstacle detection sensor and a distance sensor, the processing module may be a controller for small sample twin reverse reinforcement learning, specifically, a single chip microcomputer or a dedicated robot controller, and the driving module may be a robot limb servo motor. In this embodiment, the sensor module executes step S4, the processing module executes steps S1, S2, S3, and S5, and the driving module executes step S6, so that the robot can realize the advantages of the robot motion control method based on the twin reverse reinforcement learning in this embodiment.
The computer program may be written according to the method for controlling a motion of a robot based on twin reverse reinforcement learning in this embodiment, and the computer program may be written in a memory of a computer device or an independent storage medium, and when the computer program is read out, the computer program may instruct a processor to execute the method for controlling a motion of a robot based on twin reverse reinforcement learning in this embodiment, thereby achieving the same technical effects as those of the method embodiments.
It should be noted that, unless otherwise specified, when a feature is referred to as being "fixed" or "connected" to another feature, it may be directly fixed or connected to the other feature or indirectly fixed or connected to the other feature. Furthermore, the descriptions of up, down, left, right, etc. used in the present disclosure are only relative to the mutual positional relationship of the components of the present disclosure in the drawings. As used in this disclosure, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. In addition, unless defined otherwise, all technical and scientific terms used in this example have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in the description of the embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this embodiment, the term "and/or" includes any combination of one or more of the associated listed items.
It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one type of element from another. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of the present disclosure. The use of any and all examples, or exemplary language ("e.g.," such as "or the like") provided with this embodiment is intended merely to better illuminate embodiments of the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.
It should be recognized that embodiments of the present invention can be realized and implemented by computer hardware, a combination of hardware and software, or by computer instructions stored in a non-transitory computer readable memory. The methods may be implemented in a computer program using standard programming techniques, including a non-transitory computer-readable storage medium configured with the computer program, where the storage medium so configured causes a computer to operate in a specific and predefined manner, according to the methods and figures described in the detailed description. Each program may be implemented in a high level procedural or object oriented programming language to communicate with a computer system. However, the program(s) can be implemented in assembly or machine language, if desired. In any case, the language may be a compiled or interpreted language. Furthermore, the program can be run on a programmed application specific integrated circuit for this purpose.
Further, operations of processes described in this embodiment can be performed in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The processes described in this embodiment (or variations and/or combinations thereof) may be performed under the control of one or more computer systems configured with executable instructions, and may be implemented as code (e.g., executable instructions, one or more computer programs, or one or more applications) collectively executed on one or more processors, by hardware, or combinations thereof. The computer program includes a plurality of instructions executable by one or more processors.
Further, the method may be implemented in any type of computing platform operatively connected to a suitable connection, including but not limited to a personal computer, mini computer, mainframe, workstation, networked or distributed computing environment, separate or integrated computer platform, or in communication with a charged particle tool or other imaging device, or the like. Aspects of the invention may be embodied in machine-readable code stored on a non-transitory storage medium or device, whether removable or integrated into a computing platform, such as a hard disk, optically read and/or write storage medium, RAM, ROM, or the like, such that it may be read by a programmable computer, which when read by the storage medium or device, is operative to configure and operate the computer to perform the procedures described herein. Further, the machine-readable code, or portions thereof, may be transmitted over a wired or wireless network. The invention described in this embodiment includes these and other different types of non-transitory computer-readable storage media when such media include instructions or programs that implement the steps described above in conjunction with a microprocessor or other data processor. The invention also includes the computer itself when programmed according to the methods and techniques described herein.
A computer program can be applied to input data to perform the functions described in this embodiment to convert the input data to generate output data that is stored to non-volatile memory. The output information may also be applied to one or more output devices, such as a display. In a preferred embodiment of the invention, the transformed data represents physical and tangible objects, including particular visual depictions of physical and tangible objects produced on a display.
The present invention is not limited to the above embodiments, and any modifications, equivalent substitutions, improvements, etc. within the spirit and principle of the present invention should be included in the protection scope of the present invention as long as the technical effects of the present invention are achieved by the same means. The invention is capable of other modifications and variations in its technical solution and/or its implementation, within the scope of protection of the invention.

Claims (8)

1. A robot motion control method based on twin reverse reinforcement learning is characterized by comprising the following steps:
acquiring an incentive value training set and an incentive value testing set;
twin learning is carried out on the reward value training set and the reward value testing set, and an optimal reward value is obtained;
acquiring an action set of a robot and a transition probability set corresponding to the action set;
acquiring the feedback quantity of the robot at a first moment;
solving through a reverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set and the feedback quantity;
controlling the robot to act at a second moment according to the solving result of the reverse reinforcement learning algorithm; the second moment is a moment after the first moment;
the twin learning of the reward value training set and the reward value test set to obtain the optimal reward value comprises:
acquiring a twin learning model;
setting a target reward value; the initial value of the target reward value is set according to a selected value in the reward value training set;
starting from the initial value of the target reward value, selecting a new value in the reward value training set according to the gradient descending direction to update the target reward value;
taking the target prize value corresponding to the smallest first expected loss as the optimal prize value; the first expected loss is determined according to the output value of the twin learning model after the target reward value and the value in the reward value test set are input into the twin learning model;
solving through a reverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set and the feedback quantity, wherein the solving comprises the following steps:
by the formula
Figure FDA0004034444890000011
Determination of A 1 (ii) a In the formula, A 1 Represents an action in the action set, A represents a division A in the action set 1 Other actions than P A Represents a transition probability matrix corresponding to A in the set of transition probabilities, and->
Figure FDA0004034444890000012
Representing the sum of A in the transition probability set 1 Corresponding transition probability matrix, I denotes identity matrix, R * Represents the optimal prize value and r represents the feedback amount.
2. The robot motion control method of claim 1, further comprising:
pre-training the twin learning model.
3. The robot motion control method of claim 2, wherein the pre-training the twin learning model comprises:
configuring parameter values of the twin learning model;
setting an initial value of the parameter value;
updating the parameter values in the gradient descending direction from the initial values of the parameter values;
when the parameter value is updated to correspond to the minimum second expected loss, ending the pre-training of the twin learning model; the second expected loss is determined according to the output value of the twin learning model after the values in the reward value training set and the values in the reward value testing set are input into the twin learning model.
4. The robot motion control method according to claim 1, wherein the controlling the robot motion at the second time based on the result of the solution of the inverse reinforcement learning algorithm includes:
get action A 1 A corresponding control instruction;
and sending the control instruction to the robot.
5. The robot motion control method according to claim 1, wherein the robot motion control method is executed when the robot detects an obstacle or does not reach a target point, and is stopped when the robot reaches the target point.
6. A robot, characterized in that the robot comprises:
the sensing module is used for acquiring the feedback quantity of the robot at a first moment;
the processing module is used for acquiring an incentive value training set and an incentive value testing set, performing twin learning on the incentive value training set and the incentive value testing set to acquire an optimal incentive value, acquiring an action set of the robot and a transition probability set corresponding to the action set, and solving through a reverse reinforcement learning algorithm according to the optimal incentive value, the action set, the transition probability set and the feedback quantity;
the driving module is used for controlling the action of the robot at a second moment according to the solving result of the reverse reinforcement learning algorithm; the second moment is a moment after the first moment;
the twin learning is carried out on the reward value training set and the reward value testing set to obtain the optimal reward value, and the method comprises the following steps:
acquiring a twin learning model;
setting a target reward value; the initial value of the target reward value is set according to a selected value in the reward value training set;
starting from the initial value of the target reward value, selecting a new value in the reward value training set according to the gradient descending direction to update the target reward value;
taking the target prize value corresponding to the smallest first expected loss as the optimal prize value; the first expected loss is determined according to the output value of the twin learning model after the target reward value and the value in the reward value test set are input into the twin learning model;
solving through a reverse reinforcement learning algorithm according to the optimal reward value, the action set, the transition probability set and the feedback quantity, wherein the solving comprises the following steps:
by the formula
Figure FDA0004034444890000031
Determination of A 1 (ii) a In the formula, A 1 Represents an action in the action set, A represents a division A in the action set 1 Other actions than P A Represents a transition probability matrix corresponding to A in the set of transition probabilities, and->
Figure FDA0004034444890000032
Representing the sum of A in the transition probability set 1 Corresponding transition probability matrix, I denotes identity matrix, R * Represents the optimal prize value and r represents the feedback amount.
7. A computer arrangement comprising a memory for storing at least one program and a processor for loading the at least one program to perform the method of any one of claims 1-5.
8. A storage medium having stored thereon a program executable by a processor, the program executable by the processor when executed by the processor being adapted to perform the method of any of claims 1 to 5.
CN202111192042.6A 2021-10-13 2021-10-13 Robot motion control method, robot, computer device, and storage medium Active CN114047745B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111192042.6A CN114047745B (en) 2021-10-13 2021-10-13 Robot motion control method, robot, computer device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111192042.6A CN114047745B (en) 2021-10-13 2021-10-13 Robot motion control method, robot, computer device, and storage medium

Publications (2)

Publication Number Publication Date
CN114047745A CN114047745A (en) 2022-02-15
CN114047745B true CN114047745B (en) 2023-04-07

Family

ID=80204661

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111192042.6A Active CN114047745B (en) 2021-10-13 2021-10-13 Robot motion control method, robot, computer device, and storage medium

Country Status (1)

Country Link
CN (1) CN114047745B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116449850B (en) * 2023-06-12 2023-09-15 南京泛美利机器人科技有限公司 Three-body cooperative transportation method and system based on behavioral cloning and cooperative coefficient
CN117001673B (en) * 2023-09-11 2024-06-04 中广核工程有限公司 Training method and device for robot control model and computer equipment

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5330138B2 (en) * 2008-11-04 2013-10-30 本田技研工業株式会社 Reinforcement learning system
CN109540151B (en) * 2018-03-25 2020-01-17 哈尔滨工程大学 AUV three-dimensional path planning method based on reinforcement learning
CN109760046A (en) * 2018-12-27 2019-05-17 西北工业大学 Robot for space based on intensified learning captures Tum bling Target motion planning method
CN110794832B (en) * 2019-10-21 2021-11-09 同济大学 Mobile robot path planning method based on reinforcement learning
CN111179121B (en) * 2020-01-17 2023-03-21 华南理工大学 Power grid emergency control method based on expert system and deep reverse reinforcement learning
CN111367172B (en) * 2020-02-28 2021-09-21 华南理工大学 Hybrid system energy management strategy based on reverse deep reinforcement learning
CN111638646B (en) * 2020-05-29 2024-05-28 平安科技(深圳)有限公司 Training method and device for walking controller of quadruped robot, terminal and storage medium
CN112882469B (en) * 2021-01-14 2022-04-08 浙江大学 Deep reinforcement learning obstacle avoidance navigation method integrating global training
CN113495578B (en) * 2021-09-07 2021-12-10 南京航空航天大学 Digital twin training-based cluster track planning reinforcement learning method

Also Published As

Publication number Publication date
CN114047745A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
JP7367233B2 (en) System and method for robust optimization of reinforcement learning based on trajectory-centered models
CN114047745B (en) Robot motion control method, robot, computer device, and storage medium
US11235461B2 (en) Controller and machine learning device
TWI743986B (en) Motor control device and motor control method
CN112045675B (en) Robot equipment controller, robot equipment system and control method thereof
JP7301034B2 (en) System and Method for Policy Optimization Using Quasi-Newton Trust Region Method
US11897066B2 (en) Simulation apparatus
CN112135717A (en) System and method for pixel-based model predictive control
Dani et al. Human-in-the-loop robot control for human-robot collaboration: Human intention estimation and safe trajectory tracking control for collaborative tasks
JP2022543926A (en) System and Design of Derivative-Free Model Learning for Robotic Systems
US11402808B2 (en) Configuring a system which interacts with an environment
US20190317472A1 (en) Controller and control method
JP2019185742A (en) Controller and control method
Caarls et al. Parallel online temporal difference learning for motor control
JP7180696B2 (en) Control device, control method and program
Oguz et al. Progressive stochastic motion planning for human-robot interaction
CN114529010A (en) Robot autonomous learning method, device, equipment and storage medium
CN114193443A (en) Apparatus and method for controlling robot apparatus
US11703871B2 (en) Method of controlling a vehicle and apparatus for controlling a vehicle
US20230241770A1 (en) Control device, control method and storage medium
Rottmann et al. Adaptive autonomous control using online value iteration with gaussian processes
JP6438354B2 (en) Self-position estimation apparatus and mobile body equipped with self-position estimation apparatus
CN110962120B (en) Network model training method and device, and mechanical arm motion control method and device
Le et al. Model-based Q-learning for humanoid robots
CN104345637B (en) Method and apparatus for adapting a data-based function model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant