WO2023044878A1 - 运动控制方法及装置 - Google Patents

运动控制方法及装置 Download PDF

Info

Publication number
WO2023044878A1
WO2023044878A1 PCT/CN2021/120801 CN2021120801W WO2023044878A1 WO 2023044878 A1 WO2023044878 A1 WO 2023044878A1 CN 2021120801 W CN2021120801 W CN 2021120801W WO 2023044878 A1 WO2023044878 A1 WO 2023044878A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
reinforcement learning
motion control
value
learning model
Prior art date
Application number
PCT/CN2021/120801
Other languages
English (en)
French (fr)
Inventor
王子健
范顺杰
Original Assignee
西门子股份公司
西门子(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西门子股份公司, 西门子(中国)有限公司 filed Critical 西门子股份公司
Priority to PCT/CN2021/120801 priority Critical patent/WO2023044878A1/zh
Priority to CN202180101498.9A priority patent/CN117813561A/zh
Publication of WO2023044878A1 publication Critical patent/WO2023044878A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05BCONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
    • G05B13/00Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion
    • G05B13/02Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric
    • G05B13/04Adaptive control systems, i.e. systems automatically adjusting themselves to have a performance which is optimum according to some preassigned criterion electric involving the use of models or simulators

Definitions

  • the present invention mainly relates to the field of motion control, in particular to a motion control method and device.
  • Motion control optimization is crucial to improving the performance of product production lines.
  • the speed control optimization of servo motors and the optimization of synchronous multi-axis position control can significantly improve the performance of product production lines.
  • Motion control optimization is usually achieved by selecting model parameters.
  • the dynamic model of the controlled object (such as a driver) is designed by experienced domain experts. Selecting an appropriate model and optimizing model parameters will result in better motion control performance, but This manual optimization process is time-consuming and labor-intensive, and is not efficient.
  • reinforcement learning is introduced to learn the optimal parameters in the motion control model.
  • This type of reinforcement learning can realize automatic optimization.
  • the modeling of controlled objects requires deep domain knowledge and is limited
  • the performance improvement effect of the motion model itself is limited.
  • the reinforcement learning model can only be applied to the current controlled object and cannot be reused in other controlled objects.
  • the present invention provides a motion control method and device, and improves the efficiency of reinforcement learning model modeling in motion control.
  • the present invention proposes a motion control method, the motion control method includes: determining the motion control model of the controlled object, training an online reinforcement learning model according to the motion control model, and the motion control model outputs A model control value, the controlled object generates a feedback value according to the model control value and the initial control value output by the online reinforcement learning model; using the model control value and the feedback value to calculate rewards; the online The reinforcement learning model generates a residual control value according to the reward, the model control value and the feedback value, and controls the movement of the controlled object according to the residual control value and the model control value. For this reason, the online reinforcement learning model is trained based on the motion control model, without training from scratch, which improves the training efficiency of the online reinforcement learning model.
  • the motion control method includes: sending the motion control model, model control value, feedback value, and reward to the cloud, and training an offline reinforcement learning model according to the motion control model, model control value, feedback value, and reward , update the original online reinforcement learning model with an offline reinforcement learning model, or deploy an offline reinforcement learning model to a motion control system that does not have an online reinforcement learning model.
  • the data collected during the online reinforcement learning process of the controlled object is uploaded to the cloud for classification and training to form an offline reinforcement learning model, which can be deployed in the motion control system of the same kinematics type, which improves the performance of the reinforcement learning model in motion control. Versatility.
  • the online reinforcement learning model before updating the online reinforcement learning model to the offline reinforcement learning model, it includes: acquiring the kinematics type of the controlled object, where the kinematics type of the controlled object and the kinematics of the offline reinforcement learning model When the types are consistent, use the offline reinforcement learning model to update the original online reinforcement learning model, or deploy the offline reinforcement learning model to a motion control system that does not have an online reinforcement learning model. For this reason, by judging the consistency between the kinematic type of the controlled object and the kinematic type of the offline reinforcement learning model, the pertinence of update or deployment can be improved.
  • the model control value includes a shaft position control value
  • the feedback value includes a shaft position feedback value
  • calculating the reward by using the model control value and the feedback value includes: according to the shaft position control value and the The shaft position feedback value is used to calculate the shaft position following error, and the reward is calculated according to the shaft position following error.
  • the reward is calculated via the control and feedback values of the axis position.
  • determining the motion control model of the controlled object includes: receiving a kinematics type selected by a user and input model parameters under which the controlled object starts. For this reason, the user only needs to roughly select the kinematics type and input model parameters without optimizing the parameters, which reduces the requirements for user modeling and improves the automation and intelligence of motion control.
  • the motion control model, model control value, feedback value, and reward are sent to the cloud, the motion control model, model control value, feedback value, and reward are classified into multiple training data sets, and the multiple training data sets to train the offline reinforcement learning model.
  • the training of an offline reinforcement learning model is implemented.
  • the present invention also proposes a motion control device, the motion control device includes: a determination module, which determines the motion control model of the controlled object, trains an online reinforcement learning model according to the motion control model, and the motion control model outputs a A model control value, the controlled object generates a feedback value according to the model control value and the initial control value output by the online reinforcement learning model; a reward calculation module uses the model control value and the feedback value to calculate rewards; A control module, wherein the online reinforcement learning model generates a residual control value according to the reward, the model control value and the feedback value, and controls the controlled object according to the residual control value and the model control value sports.
  • a determination module which determines the motion control model of the controlled object, trains an online reinforcement learning model according to the motion control model, and the motion control model outputs a A model control value, the controlled object generates a feedback value according to the model control value and the initial control value output by the online reinforcement learning model
  • a reward calculation module uses the model control value and the feedback value to calculate rewards
  • a control module
  • the motion control device includes: sending the motion control model, model control value, feedback value, and reward to the cloud, and training an offline reinforcement learning model according to the motion control model, model control value, feedback value, and reward , update the original online reinforcement learning model with an offline reinforcement learning model, or deploy an offline reinforcement learning model to a motion control system that does not have an online reinforcement learning model.
  • the online reinforcement learning model before updating the online reinforcement learning model to the offline reinforcement learning model, it includes: acquiring the kinematics type of the controlled object, where the kinematics type of the controlled object and the kinematics of the offline reinforcement learning model When the types are consistent, use the offline reinforcement learning model to update the original online reinforcement learning model, or deploy the offline reinforcement learning model to a motion control system that does not have an online reinforcement learning model.
  • the model control value includes a shaft position control value
  • the feedback value includes a shaft position feedback value
  • calculating the reward by using the model control value and the feedback value includes: according to the shaft position control value and the The shaft position feedback value is used to calculate the shaft position following error, and the reward is calculated according to the shaft position following error.
  • determining the motion control model of the controlled object includes: receiving a kinematics type selected by a user and input model parameters under which the controlled object starts.
  • the motion control model, model control value, feedback value, and reward are sent to the cloud, the motion control model, model control value, feedback value, and reward are classified into multiple training data sets, and the multiple training data sets to train the offline reinforcement learning model.
  • the present invention also proposes an electronic device, including a processor, a memory, and instructions stored in the memory, wherein the instructions implement the method as described above when executed by the processor.
  • the present invention also proposes a computer-readable storage medium on which computer instructions are stored, and the computer instructions execute the method as described above when executed.
  • Fig. 1 is a flow chart of a kind of motion control method according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a motion control method according to an embodiment of the present invention.
  • Fig. 3 is a schematic diagram of a motion control device according to an embodiment of the present invention.
  • Fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present invention.
  • Fig. 1 is a flowchart of a motion control method according to an embodiment of the present invention. As shown in Figure 1, the motion control method 100 includes:
  • Step 110 determine the motion control model of the controlled object, train an online reinforcement learning model according to the motion control model, the motion control model outputs a model control value, and the controlled object generates according to the model control value and the initial control value output by the online reinforcement learning model a feedback value.
  • determining the motion control model of the controlled object includes: receiving a user-selected kinematics type and input model parameters, and the controlled object starts under the kinematics type and model parameters. For this reason, the user only needs to adjust the model parameters until the controlled object can be started, and the user does not need to optimize the parameters, which significantly reduces the workload and improves the efficiency of motion control.
  • the type of kinematics can be designed or selected by the user based on the kinematics or application requirements of the controlled object (such as a single-axis drive or a synchronous multi-axis drive).
  • proportional-integral-derivative control is selected for a single-axis drive, or Cartesian position control is selected for a synchronized multi-axis drive. So far, by selecting the kinematics type of the motion control model and initially inputting the parameters of the motion control model, the motion control model of the controlled object can be determined, and the motion control model outputs a model control value U m .
  • an online reinforcement learning model is trained according to the motion control model, the online reinforcement learning model will output an initial control value U a0 at the initial moment, the controlled object moves based on the model control value U m and the initial control value U a0 , and A feedback value is generated during the process, and the feedback value can be a shaft position value, a shaft speed value, a shaft torque value, and the like.
  • Fig. 2 is a schematic diagram of a motion control method according to an embodiment of the present invention.
  • Fig. 2 shows a plurality of motion control systems A, B, C, and each motion control system includes a control device and a controlled object.
  • the control device includes an edge device 220 and a controller 210, and the controlled object is a driver 240, which can drive a motor to rotate.
  • the control device in the embodiment of the present invention is not limited thereto, and the control device may also be in a single hardware form.
  • the edge device 220 can be an industrial computer (IPC), and the controller 210 can be a programmable logic controller (PLC).
  • IPC industrial computer
  • PLC programmable logic controller
  • the control device can be an industrial computer with a virtual PLC configured inside, or the control device can be a PLC , with a computing module integrated inside.
  • the controller 210 includes a motion control model 211, the motion control model 211 can be selected by the user for the kinematics type of the motion control model, such as PID control or Cartesian position control, and input model parameters, thereby determining the motion control Model 211, the motion control model 211 outputs a model control value U m , and trains an online reinforcement learning model according to the motion control model.
  • the online reinforcement learning model will output an initial control value U a0 at the initial moment, and the controlled object is based on the model control value U m and the initial control value U a0 to move, and generate a feedback value during the movement.
  • Step 120 calculate the reward by using the model control value and the feedback value.
  • rewards can be calculated using model control values and feedback values, such as axis position following error, axis velocity following error, axis torque following error, Cartesian position following error, and Cartesian velocity following error, etc.
  • the model control value includes a shaft position control value
  • the feedback value includes a shaft position feedback value
  • calculating the reward using the model control value and the feedback value includes: calculating the shaft position following error according to the shaft position control value and the shaft position feedback value, Computes the reward based on the axis position following error.
  • the reward can be calculated by the following formula:
  • r is the reward
  • err pos is the axis position following error, which can be obtained by subtracting the feedback axis position value from the model axis position control value.
  • the reward can be calculated by the following formula:
  • r is the reward
  • err x , err y , and err z are the Cartesian position errors in the X, Y, and Z directions, respectively.
  • the data acquisition module 221 collects the model control value output by the motion control model 211 and the feedback value generated by the driver 240, and sends the model control value and feedback value to the reward calculation module 222, and the reward calculation module 222 controls the The value and the feedback value calculate the reward and send the reward to the reinforcement learning model 223.
  • Step 130 the online reinforcement learning model generates a residual control value according to the reward, the model control value and the feedback value, and controls the movement of the controlled object according to the residual control value and the model control value.
  • the SARSA algorithm (state-action-reward-state-action) can be used to train the online reinforcement learning model.
  • the trained online reinforcement learning model generates a residual control value U a according to the reward, model control value and feedback value, and according to the residual control value
  • the value U a and the model control value U m control the movement of the controlled object.
  • the reinforcement learning model 223 receives the reward sent by the reward calculation module 222, receives the model control value and feedback value sent by the data acquisition module, and generates a residual control value U according to the reward, model control value and feedback value a , and send the residual control value to the transceiver 212 in the controller 210, the transceiver 212 sends the residual control value U a and the model control value U m to the driver 240, and the controlled object is based on the residual control value U a and the model control value U m move, and continuously generate feedback values during the movement, and the feedback values continue to be sent to the data acquisition module 221, and the previous process of iteration is repeated until the following error is eliminated, so far the expected control is achieved.
  • the motion control method may include sending data (including motion control models, model control values, feedback values, rewards, etc.) collected during the online reinforcement learning process of multiple control systems to On the cloud, a general offline reinforcement learning model is trained according to the motion control model, model control values, feedback values, and rewards.
  • the relevant data used to train the offline reinforcement learning model needs to be collected from the controlled object and its control system with the same kinematics type.
  • the offline reinforcement learning model before updating the online reinforcement learning model to the offline reinforcement learning model, it includes: obtaining the kinematics type of the controlled object, and when the kinematics type of the controlled object is consistent with the offline reinforcement learning model
  • the offline reinforcement learning model is used to update the original online reinforcement learning model, or the offline reinforcement learning model is deployed to a motion control system that does not have an online reinforcement learning model.
  • the motion control model, model control value, feedback value, and reward are sent to the cloud, the motion control model, model control value, feedback value, and reward are classified into multiple training data sets, and multiple training data sets are used.
  • the set trains the offline reinforcement learning model.
  • a CQL algorithm Consservative Q-Learning algorithm
  • cloud 230 comprises training data processing module 231, training data set 232 and off-line reinforcement learning model 233
  • data acquisition module 231 sends motion control model, model control value, feedback value, reward to
  • the training data processing module 231 classifies the motion control model, model control value, feedback value, and reward into a plurality of training data sets 232 according to the kinematic type, and uses the multiple training data sets 232 to adopt CQL
  • the algorithm (Conservative Q-Learning algorithm) trains the offline reinforcement learning model 233, thereby realizing the training of the offline reinforcement learning model 233, updating the trained offline reinforcement learning model 233 to the online reinforcement learning model of the motion control system A, Or deployed to motion control systems B and C that do not have an online reinforcement learning model, thereby improving the versatility of reinforcement learning models in motion control.
  • the embodiment of the present invention provides a motion control method.
  • the online reinforcement learning model is trained based on the motion control model, without training from scratch, and the training efficiency of the online reinforcement learning model is improved.
  • the data collected during the online reinforcement learning process of the controlled object is uploaded to the cloud for classification and training to form an offline reinforcement learning model, which can be deployed in the motion control system of the same kinematics type, which improves the versatility of the reinforcement learning model in motion control sex.
  • FIG. 3 is a schematic diagram of a motion control device 300 according to an embodiment of the present invention. As shown in FIG. 3 , the motion control device 300 includes:
  • Determining module 310 determining the motion control model of the controlled object, training an online reinforcement learning model according to the motion control model, the motion control model outputs a model control value, and the controlled object outputs an initial control value according to the model control value and the online reinforcement learning model generate a feedback value;
  • Reward calculation module 320 using model control values and feedback values to calculate rewards
  • the online reinforcement learning model In the control module 330, the online reinforcement learning model generates a residual control value according to the reward, the model control value and the feedback value, and controls the movement of the controlled object according to the residual control value and the model control value.
  • the motion control device 300 includes: sending the motion control model, model control value, feedback value, and reward to the cloud, and training an offline reinforcement learning model according to the motion control model, model control value, feedback value, and reward, using The offline reinforcement learning model updates the original online reinforcement learning model, or deploys the offline reinforcement learning model to a motion control system that does not have an online reinforcement learning model.
  • before updating the online reinforcement learning model to the offline reinforcement learning model it includes: obtaining the kinematics type of the controlled object, and when the kinematics type of the controlled object is consistent with the kinematics type of the offline reinforcement learning model, use The offline reinforcement learning model updates the original online reinforcement learning model, or deploys the offline reinforcement learning model to a motion control system that does not have an online reinforcement learning model.
  • the model control value includes a shaft position control value
  • the feedback value includes a shaft position feedback value
  • calculating the reward using the model control value and the feedback value includes: calculating the shaft position following error according to the shaft position control value and the shaft position feedback value, Computes the reward based on the axis position following error.
  • determining the motion control model of the controlled object includes: receiving a motion control model selected by a user and input model parameters under which the controlled object starts.
  • the motion control model, model control value, feedback value, and reward are sent to the cloud, the motion control model, model control value, feedback value, and reward are classified into multiple training data sets, and multiple training data sets are used.
  • the set trains the offline reinforcement learning model.
  • FIG. 4 is a schematic diagram of an electronic device 400 according to an embodiment of the present invention.
  • the electronic device 400 includes a processor 410 and a memory 420 , and the memory 420 stores instructions, wherein the instructions are executed by the processor 410 to implement the method 100 as described above.
  • the present invention also proposes a computer-readable storage medium on which computer instructions are stored, and when executed, the computer instructions execute the method 100 as described above.
  • Some aspects of the method and apparatus of the present invention may be entirely implemented by hardware, may be entirely implemented by software (including firmware, resident software, microcode, etc.), or may be implemented by a combination of hardware and software.
  • the above hardware or software may be referred to as “block”, “module”, “engine”, “unit”, “component” or “system”.
  • the processor can be one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DAPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), processors , a controller, a microcontroller, a microprocessor, or a combination thereof.
  • ASICs Application Specific Integrated Circuits
  • DSPs Digital Signal Processors
  • DAPDs Digital Signal Processing Devices
  • PLDs Programmable Logic Devices
  • FPGAs Field Programmable Gate Arrays
  • processors a controller, a microcontroller, a microprocessor, or a combination thereof.
  • aspects of the present invention may be embodied as a computer product comprising computer readable program code on one or more computer readable media.
  • computer-readable media may include, but are not limited to, magnetic storage devices (e.g., hard disk, floppy disk, magnetic tape, ...), optical disks (e.g., compact disk (CD), digital versatile disk (DVD) %), smart cards And flash memory devices (eg, cards, sticks, key drives).

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

一种运动控制方法(100),包括:确定受控对象的运动控制模型(211),根据运动控制模型(211)训练一在线强化学习模型(223),运动控制模型(211)输出一模型控制值,受控对象根据模型控制值和在线强化学习模型(223)输出的初始控制值产生一反馈值(110);利用模型控制值和反馈值计算奖励(120);在线强化学习模型(223)根据奖励、模型控制值和反馈值生成一残差控制值,根据残差控制值和模型控制值控制受控对象运动(130)。

Description

运动控制方法及装置 技术领域
本发明主要涉及运动控制领域,尤其涉及一种运动控制方法及装置。
背景技术
运动控制优化对于提升产品生产线的性能至关重要,例如对伺服电机的速度控制优化、同步多轴位置控制优化可以显著提升产品生产线的性能。
运动控制优化通常是通过选择模型参数来实现的,受控对象(例如驱动器)的动态模型由富有经验的领域专家设计,选择合适的模型以及优化模型参数将会得到更好的运动控制性能,但是这个手动的优化过程耗时耗力,效率不高。
为了克服手动优化的缺陷,强化学习被引入以学习运动控制模型中的最优参数,此类强化学习可以实现自动化优化,然而,受控对象的建模,需要较深的领域知识,并且受限于运动模型本身性能提升的效果有限,此外,强化学习模型只能适用于当前的受控对象,无法复用到其它的受控对象中。
发明内容
为了解决上述技术问题,本发明提供一种运动控制方法及装置,并提高运动控制中强化学习模型建模的效率。
为实现上述目的,本发明提出了一种运动控制方法,所述运动控制方法包括:确定受控对象的运动控制模型,根据所述运动控制模型训练一在线强化学习模型,所述运动控制模型输出一模型控制值,所述受控对象根据所述模型控制值和所述在线强化学习模型输出的初始控制值产生一反馈值;利用所述模型控制值和所述反馈值计算奖励;所述在线强化学习模型根据所述奖励、所述模型控制值和所述反馈值生成一残差控制值,根据所述残差控制值和所述模型控制值控制所述受控对象运动。为此,在线强化学习模型基于运动控制模型训练得到,无需从头开始训练,提高了在线强化学习模型的训练效率。
优选地,所述运动控制方法包括:将所述运动控制模型、模型控制值、反馈值、奖励发送至云端,根据所述运动控制模型、模型控制值、反馈值、奖励训练一离线强化学习模型,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。为此,将受控对象的在线强化学习过程中采集的数据上传至云端进行分类训练出离线强化学习模型,可以部署在相同运动学类型的运动控制 系统中,提高了运动控制中强化学习模型的通用性。
优选地,将所述在线强化学习模型更新为所述离线强化学习模型之前包括:获取受控对象的运动学类型,在所述受控对象的运动学类型与所述离线强化学习模型的运动学类型一致时,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。为此,通过判断受控对象的运动学类型与离线强化学习模型的运动学类型的一致性,可以提高更新或部署的针对性。
优选地,所述模型控制值包括轴位置控制值,所述反馈值包括轴位置反馈值,利用所述模型控制值和所述反馈值计算奖励包括:根据所述所述轴位置控制值和所述轴位置反馈值计算轴位置跟随误差,根据所述轴位置跟随误差计算奖励。为此,实现了通过轴位置的控制值和反馈值计算奖励。
优选地,确定受控对象的运动控制模型包括:接收用户选择的运动学类型和输入的模型参数,在所述运动学类型和模型参数下所述受控对象启动。为此,用户仅需粗略地选择运动学类型和输入模型参数,无需优化参数,降低了对用户建模的要求,提高了运动控制的自动化程度和智能性。
优选地,将所述运动控制模型、模型控制值、反馈值、奖励发送至云端之后,将所述运动控制模型、模型控制值、反馈值、奖励分类为多个训练数据集,利用所述多个训练数据集对所述离线强化学习模型进行训练。为此,实现了对离线强化学习模型的训练。
本发明还提出了一种运动控制装置,所述运动控制装置包括:确定模块,确定受控对象的运动控制模型,根据所述运动控制模型训练一在线强化学习模型,所述运动控制模型输出一模型控制值,所述受控对象根据所述模型控制值和所述在线强化学习模型输出的初始控制值产生一反馈值;奖励计算模块,利用所述模型控制值和所述反馈值计算奖励;控制模块,所述在线强化学习模型根据所述奖励、所述模型控制值和所述反馈值生成一残差控制值,根据所述残差控制值和所述模型控制值控制所述受控对象运动。
优选地,所述运动控制装置包括:将所述运动控制模型、模型控制值、反馈值、奖励发送至云端,根据所述运动控制模型、模型控制值、反馈值、奖励训练一离线强化学习模型,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
优选地,将所述在线强化学习模型更新为所述离线强化学习模型之前包括:获取受控对象的运动学类型,在所述受控对象的运动学类型与所述离线强化学习模型的运动学类型一致时,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
优选地,所述模型控制值包括轴位置控制值,所述反馈值包括轴位置反馈值,利用所述模型控制值和所述反馈值计算奖励包括:根据所述所述轴位置控制值和所述轴位置反馈值计算轴位置跟随误差,根据所述轴位置跟随误差计算奖励。
优选地,确定受控对象的运动控制模型包括:接收用户选择的运动学类型和输入的模型参数,在所述运动学类型和模型参数下所述受控对象启动。
优选地,将所述运动控制模型、模型控制值、反馈值、奖励发送至云端之后,将所述运动控制模型、模型控制值、反馈值、奖励分类为多个训练数据集,利用所述多个训练数据集对所述离线强化学习模型进行训练。
本发明还提出了一种电子设备,包括处理器、存储器和存储在所述存储器中的指令,其中所述指令被所述处理器执行时实现如上文所述的方法。
本发明还提出了一种计算机可读存储介质,其上存储有计算机指令,所述计算机指令在被运行时执行如上文所述的方法。
附图说明
以下附图仅旨在于对本发明做示意性说明和解释,并不限定本发明的范围。其中,
图1是根据本发明的一实施例的一种的运动控制方法的流程图;
图2是根据本发明的一实施例的一种的运动控制方法的示意图;
图3是根据本发明的一实施例的一种的运动控制装置的示意图;
图4是根据本发明的一实施例的一种电子设备的示意图。
附图标记说明
100运动控制方法
110-130步骤
210控制器
211运动控制模型
212收发器
220边缘设备
221数据采集模块
222奖励计算模块
223在线强化学习模型
230云端
231训练数据处理模块
232训练数据集
233离线强化学习模型
300运动控制装置
310确定模块
320奖励计算模块
330控制模块
400电子设备
410处理器
420存储器
具体实施方式
为了对本发明的技术特征、目的和效果有更加清楚的理解,现对照附图说明本发明的具体实施方式。
在下面的描述中阐述了很多具体细节以便于充分理解本发明,但是本发明还可以采用其它不同于在此描述的其它方式来实施,因此本发明不受下面公开的具体实施例的限制。
如本申请和权利要求书中所示,除非上下文明确提示例外情形,“一”、“一个”、“一种”和/或“该”等词并非特指单数,也可包括复数。一般说来,术语“包括”与“包含”仅提示包括已明确标识的步骤和元素,而这些步骤和元素不构成一个排它性的罗列,方法或者设备也可能包含其他的步骤或元素。
图1是根据本发明的一实施例的一种的运动控制方法的流程图。如图1所示,运动控制方法100包括:
步骤110,确定受控对象的运动控制模型,根据运动控制模型训练一在线强化学习模型,运动控制模型输出一模型控制值,受控对象根据模型控制值和在线强化学习模型输出的初始控制值产生一反馈值。
在一些实施例中,确定受控对象的运动控制模型包括:接收用户选择的运动学类型和输入的模型参数,在运动学类型和模型参数下所述受控对象启动。为此,用户只需调整模型参数至受控对象能够启动,无需用户进行参数优化,显著降低了工作量,提高了运动控制的效率。运动学类型可以由用户基于受控对象(例如单轴驱动器或同步多轴驱动器)的运动学或应用需求进行设计或选择。示例性地,对于单轴驱动器选择比例-积分-微分控制(PID控制),或对于同步多轴驱动器选择笛卡尔位置控制。至此,通过选择运动控制模型 的运动学类型和初步输入运动控制模型的参数,可以确定受控对象的运动控制模型,该运动控制模型输出一模型控制值U m。此外,根据运动控制模型训练一在线强化学习模型,在线强化学习模型在初始时刻会输出一初始控制值U a0,受控对象基于模型控制值U m和初始控制值U a0进行运动,并在运动过程中产生一反馈值,反馈值可以是轴位置值、轴速度值、轴扭矩值等。
图2是根据本发明的一实施例的一种的运动控制方法的示意图。图2示出了多个运动控制系统A、B、C,各运动控制系统包括控制装置和受控对象。以运动控制系统A为例,控制装置包括边缘设备220和控制器210,受控对象为驱动器240,驱动器240可以驱动电机转动。可以理解,本发明的实施例中的控制装置并非限于此,控制装置也可是单个的硬件形态。边缘设备220可以是工业计算机(IPC),控制器210可以是可编程逻辑控制器(PLC),在另外一些形态中,控制装置可以是工业计算机,内部配置有虚拟PLC,或者控制装置可以是PLC,内部集成有计算模块。如图2所示,控制器210包括运动控制模型211,运动控制模型211可以由用户选择运动控制模型的运动学类型,例如PID控制或笛卡尔位置控制,并输入模型参数,由此确定运动控制模型211,运动控制模型211输出一模型控制值U m,根据运动控制模型训练一在线强化学习模型,在线强化学习模型在初始时刻会输出一初始控制值U a0,受控对象基于模型控制值U m和初始控制值U a0进行运动,并在运动过程中产生一反馈值。
步骤120,利用模型控制值和反馈值计算奖励。
根据运动控制模型和驱动器运动学类型,可以利用模型控制值和反馈值计算奖励,例如轴位置跟随误差、轴速度跟随误差、轴扭矩跟随误差、笛卡尔位置跟随误差和笛卡尔速度跟随误差等。在一些实施例中,模型控制值包括轴位置控制值,反馈值包括轴位置反馈值,利用模型控制值和反馈值计算奖励包括:根据轴位置控制值和轴位置反馈值计算轴位置跟随误差,根据轴位置跟随误差计算奖励。
例如,对于单轴驱动器,可以通过下列公式计算奖励:
r=1/|err pos|
其中,r是奖励,err pos是轴位置跟随误差,轴位置跟随误差可以通过模型轴位置控制值减去反馈轴位置值得到。
又例如,对于同步多轴驱动器,可以通过下列公式计算奖励:
r=1/||err x+err y+err z||
其中,r是奖励,err x、err y、err z分别是X,Y,Z方向的笛卡尔位置误差。
如图2所示,数据采集模块221采集到运动控制模型211输出的模型控制值和驱动器 240产生的反馈值,将模型控制值和反馈值发送至奖励计算模块222,奖励计算模块222根据模型控制值和反馈值计算奖励,并将奖励发送至强化学习模型223。
步骤130,在线强化学习模型根据奖励、模型控制值和反馈值生成一残差控制值,根据残差控制值和模型控制值控制受控对象运动。
可以采用SARSA算法(state-action-reward-state-action)训练在线强化学习模型,经训练的在线强化学习模型根据奖励、模型控制值和反馈值生成一残差控制值U a,根据残差控制值U a和模型控制值U m控制受控对象运动。
如图2所示,强化学习模型223接收到奖励计算模块222发送的奖励,接收到数据采集模块发送的模型控制值和反馈值,根据奖励、模型控制值和反馈值生成一残差控制值U a,并将残差控制值发送至控制器210中的收发器212中,收发器212将残差控制值U a和模型控制值U m发送至驱动器240,受控对象基于残差控制值U a和模型控制值U m进行运动,并在运动过程中持续产生反馈值,反馈值继续发送至数据采集模块221,并重复迭代之前的过程,直至跟随误差被消除,至此达到预期的控制。
在一些实施例中,为了提升控制模型的通用性,运动控制方法可以包括将多个控制系统在线强化学习过程中采集的数据(包括运动控制模型、模型控制值、反馈值、奖励等)发送至云端,根据运动控制模型、模型控制值、反馈值、奖励训练一通用的离线强化学习模型。用于训练离线强化学习模型的相关数据,需要采集自具有相同运动学类型的受控对象及其控制系统。
在一些实施例中,将所述在线强化学习模型更新为所述离线强化学习模型之前包括:获取受控对象的运动学类型,在所述受控对象的运动学类型与所述离线强化学习模型的运动学类型一致时,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
在一些实施例中,将运动控制模型、模型控制值、反馈值、奖励发送至云端之后,将运动控制模型、模型控制值、反馈值、奖励分类为多个训练数据集,利用多个训练数据集对离线强化学习模型进行训练。示例性地,可以采用CQL算法(Conservative Q-Learning算法)对离线强化学习模型进行训练。
继续参考图2所示,还包括云端230,云端230包括训练数据处理模块231、训练数据集232和离线强化学习模型233,数据采集模块231将运动控制模型、模型控制值、反馈值、奖励发送至云端230的训练数据处理模块231,训练数据处理模块231将运动控制模型、模型控制值、反馈值、奖励按照运动学类型分类成多个训练数据集232,利用多个训练数据集232采用CQL算法(Conservative Q-Learning算法)对离线强化学习模型233 进行训练,从而实现对离线强化学习模型233的训练,将训练好的离线强化学习模型233更新到运动控制系统A的在线强化学习模型中,或部署到不具有在线强化学习模型的运动控制系统B、C中,从而提高了运动控制中强化学习模型的通用性。为了提高部署或更新的针对性,在部署或更新之前,获取受控对象的运动学类型,在受控对象的运动学类型与离线强化学习模型233的运动学类型一致时,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
本发明的实施例提供了一种运动控制方法,在线强化学习模型基于运动控制模型训练得到,无需从头开始训练,提高了在线强化学习模型的训练效率。此外,将受控对象的在线强化学习过程中采集的数据上传至云端进行分类训练出离线强化学习模型,可以部署在相同运动学类型的运动控制系统中,提高了运动控制中强化学习模型的通用性。
本发明还提出一种运动控制装置,图3是根据本发明的一实施例的一种的运动控制装置300的示意图,如图3所示,运动控制装置300包括:
确定模块310,确定受控对象的运动控制模型,根据运动控制模型训练一在线强化学习模型,运动控制模型输出一模型控制值,受控对象根据模型控制值和在线强化学习模型输出的初始控制值产生一反馈值;
奖励计算模块320,利用模型控制值和反馈值计算奖励;
控制模块330,在线强化学习模型根据奖励、模型控制值和反馈值生成一残差控制值,根据残差控制值和模型控制值控制受控对象运动。
在一些实施例中,运动控制装置300包括:将运动控制模型、模型控制值、反馈值、奖励发送至云端,根据运动控制模型、模型控制值、反馈值、奖励训练一离线强化学习模型,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
在一些实施例中,将在线强化学习模型更新为离线强化学习模型之前包括:获取受控对象的运动学类型,在受控对象的运动学类型与离线强化学习模型的运动学类型一致时,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
在一些实施例中,模型控制值包括轴位置控制值,反馈值包括轴位置反馈值,利用模型控制值和反馈值计算奖励包括:根据轴位置控制值和轴位置反馈值计算轴位置跟随误差,根据轴位置跟随误差计算奖励。
在一些实施例中,确定受控对象的运动控制模型包括:接收用户选择的运动控制模型和输入的模型参数,在所述运动控制模型和模型参数下所述受控对象启动。
在一些实施例中,将运动控制模型、模型控制值、反馈值、奖励发送至云端之后,将运动控制模型、模型控制值、反馈值、奖励分类为多个训练数据集,利用多个训练数据集对离线强化学习模型进行训练。
本发明还提出一种电子设备400。图4是根据本发明的一实施例的一种电子设备400的示意图。如图4所示,电子设备400包括处理器410和存储器420,存储器420存储中存储有指令,其中指令被处理器410执行时实现如上文所述的方法100。
本发明还提出一种计算机可读存储介质,其上存储有计算机指令,计算机指令在被运行时执行如上文所述的方法100。
本发明的方法和装置的一些方面可以完全由硬件执行、可以完全由软件(包括固件、常驻软件、微码等)执行、也可以由硬件和软件组合执行。以上硬件或软件均可被称为“数据块”、“模块”、“引擎”、“单元”、“组件”或“系统”。处理器可以是一个或多个专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理器件(DAPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、处理器、控制器、微控制器、微处理器或者其组合。此外,本发明的各方面可能表现为位于一个或多个计算机可读介质中的计算机产品,该产品包括计算机可读程序编码。例如,计算机可读介质可包括,但不限于,磁性存储设备(例如,硬盘、软盘、磁带……)、光盘(例如,压缩盘(CD)、数字多功能盘(DVD)……)、智能卡以及闪存设备(例如,卡、棒、键驱动器……)。
在此使用了流程图用来说明根据本申请的实施例的方法所执行的操作。应当理解的是,前面的操作不一定按照顺序来精确地执行。相反,可以按照倒序或同时处理各种步骤。同时,或将其他操作添加到这些过程中,或从这些过程移除某一步或数步操作。
应当理解,虽然本说明书是按照各个实施例描述的,但并非每个实施例仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施例中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。
以上所述仅为本发明示意性的具体实施方式,并非用以限定本发明的范围。任何本领域的技术人员,在不脱离本发明的构思和原则的前提下所作的等同变化、修改与结合,均应属于本发明保护的范围。

Claims (14)

  1. 一种运动控制方法(100),其特征在于,所述运动控制方法(100)包括:
    确定受控对象的运动控制模型,根据所述运动控制模型训练一在线强化学习模型,所述运动控制模型输出一模型控制值,所述受控对象根据所述模型控制值和所述在线强化学习模型输出的初始控制值产生一反馈值(110);
    利用所述模型控制值和所述反馈值计算奖励(120);
    所述在线强化学习模型根据所述奖励、所述模型控制值和所述反馈值生成一残差控制值,根据所述残差控制值和所述模型控制值控制所述受控对象运动(130)。
  2. 根据权利要求1所述的运动控制方法(100),其特征在于,所述运动控制方法(100)包括:将所述运动控制模型、模型控制值、反馈值、奖励发送至云端,根据所述运动控制模型、模型控制值、反馈值、奖励训练一离线强化学习模型,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
  3. 根据权利要求2所述的运动控制方法(100),其特征在于,将所述在线强化学习模型更新为所述离线强化学习模型之前包括:获取受控对象的运动学类型,在所述受控对象的运动学类型与所述离线强化学习模型的运动学类型一致时,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
  4. 根据权利要求1所述的运动控制方法(100),其特征在于,所述模型控制值包括轴位置控制值,所述反馈值包括轴位置反馈值,利用所述模型控制值和所述反馈值计算奖励包括:根据所述所述轴位置控制值和所述轴位置反馈值计算轴位置跟随误差,根据所述轴位置跟随误差计算奖励。
  5. 根据权利要求1所述的运动控制方法(100),其特征在于,确定受控对象的运动控制模型包括:接收用户选择的运动学类型和输入的模型参数,在所述运动学类型和模型参数下所述受控对象启动。
  6. 根据权利要求2所述的运动控制方法(100),其特征在于,将所述运动控制模型、模型控制值、反馈值、奖励发送至云端之后,将所述运动控制模型、模型控制值、反馈值、奖励分类为多个训练数据集,利用所述多个训练数据集对所述离线强化学习模型进行训练。
  7. 一种运动控制装置(300),其特征在于,所述运动控制装置(300)包括:
    确定模块(310),确定受控对象的运动控制模型,根据所述运动控制模型训练一在线 强化学习模型,所述运动控制模型输出一模型控制值,所述受控对象根据所述模型控制值和所述在线强化学习模型输出的初始控制值产生一反馈值;
    奖励计算模块(320),利用所述模型控制值和所述反馈值计算奖励;
    控制模块(330),所述在线强化学习模型根据所述奖励、所述模型控制值和所述反馈值生成一残差控制值,根据所述残差控制值和所述模型控制值控制所述受控对象运动。
  8. 根据权利要求7所述的运动控制装置(300),其特征在于,所述运动控制装置(300)包括:将所述运动控制模型、模型控制值、反馈值、奖励发送至云端,根据所述运动控制模型、模型控制值、反馈值、奖励训练一离线强化学习模型,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
  9. 根据权利要求8所述的运动控制装置(300),其特征在于,将所述在线强化学习模型更新为所述离线强化学习模型之前包括:获取受控对象的运动学类型,在所述受控对象的运动学类型与所述离线强化学习模型的运动学类型一致时,用离线强化学习模型更新原有在线强化学习模型,或将离线强化学习模型部署到不具有在线强化学习模型的运动控制系统。
  10. 根据权利要求7所述的运动控制装置(300),其特征在于,所述模型控制值包括轴位置控制值,所述反馈值包括轴位置反馈值,利用所述模型控制值和所述反馈值计算奖励包括:根据所述所述轴位置控制值和所述轴位置反馈值计算轴位置跟随误差,根据所述轴位置跟随误差计算奖励。
  11. 根据权利要求7所述的运动控制装置(300),其特征在于,确定受控对象的运动控制模型包括:接收用户选择的运动学类型和输入的模型参数,在所述运动学类型和模型参数下所述受控对象启动。
  12. 根据权利要求8所述的运动控制装置(300),其特征在于,将所述运动控制模型、模型控制值、反馈值、奖励发送至云端之后,将所述运动控制模型、模型控制值、反馈值、奖励分类为多个训练数据集,利用所述多个训练数据集对所述离线强化学习模型进行训练。
  13. 一种电子设备(400),包括处理器(410)、存储器(420)和存储在所述存储器(420)中的指令,其中所述指令被所述处理器(410)执行时实现如权利要求1-6任一项所述的方法。
  14. 一种计算机可读存储介质,其上存储有计算机指令,所述计算机指令在被运行时执行根据权利要求1-6中任一项所述的方法。
PCT/CN2021/120801 2021-09-26 2021-09-26 运动控制方法及装置 WO2023044878A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/120801 WO2023044878A1 (zh) 2021-09-26 2021-09-26 运动控制方法及装置
CN202180101498.9A CN117813561A (zh) 2021-09-26 2021-09-26 运动控制方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/120801 WO2023044878A1 (zh) 2021-09-26 2021-09-26 运动控制方法及装置

Publications (1)

Publication Number Publication Date
WO2023044878A1 true WO2023044878A1 (zh) 2023-03-30

Family

ID=85719905

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/120801 WO2023044878A1 (zh) 2021-09-26 2021-09-26 运动控制方法及装置

Country Status (2)

Country Link
CN (1) CN117813561A (zh)
WO (1) WO2023044878A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005199383A (ja) * 2004-01-15 2005-07-28 Sony Corp 動的制御装置および動的制御装置を用いた2足歩行移動体
US20160196765A1 (en) * 2014-12-24 2016-07-07 NeuroSpire, Inc. System and method for attention training using electroencephalography (EEG) based neurofeedback and motion-based feedback
CN109240280A (zh) * 2018-07-05 2019-01-18 上海交通大学 基于强化学习的锚泊辅助动力定位系统控制方法
CN109352648A (zh) * 2018-10-12 2019-02-19 北京地平线机器人技术研发有限公司 机械机构的控制方法、装置和电子设备
CN109581874A (zh) * 2018-12-29 2019-04-05 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN112698572A (zh) * 2020-12-22 2021-04-23 西安交通大学 一种基于强化学习的结构振动控制方法、介质及设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005199383A (ja) * 2004-01-15 2005-07-28 Sony Corp 動的制御装置および動的制御装置を用いた2足歩行移動体
US20160196765A1 (en) * 2014-12-24 2016-07-07 NeuroSpire, Inc. System and method for attention training using electroencephalography (EEG) based neurofeedback and motion-based feedback
CN109240280A (zh) * 2018-07-05 2019-01-18 上海交通大学 基于强化学习的锚泊辅助动力定位系统控制方法
CN109352648A (zh) * 2018-10-12 2019-02-19 北京地平线机器人技术研发有限公司 机械机构的控制方法、装置和电子设备
CN109581874A (zh) * 2018-12-29 2019-04-05 百度在线网络技术(北京)有限公司 用于生成信息的方法和装置
CN112698572A (zh) * 2020-12-22 2021-04-23 西安交通大学 一种基于强化学习的结构振动控制方法、介质及设备

Also Published As

Publication number Publication date
CN117813561A (zh) 2024-04-02

Similar Documents

Publication Publication Date Title
JP6774637B2 (ja) 制御装置及び制御方法
JP6499720B2 (ja) 機械学習装置、サーボ制御装置、サーボ制御システム、及び機械学習方法
CN109274314B (zh) 机器学习装置、伺服电动机控制装置、伺服电动机控制系统以及机器学习方法
JP4813618B2 (ja) イナーシャと摩擦を同時に推定する機能を有する電動機の制御装置
JP6669715B2 (ja) 振動抑制装置
JP2019166626A (ja) 制御装置及び機械学習装置
JP2013184245A (ja) ロボット制御装置、ロボット装置、ロボット制御方法、プログラム及び記録媒体
CN104647387A (zh) 机器人控制方法、系统和装置
JP2017102617A (ja) 補正装置、補正装置の制御方法、情報処理プログラム、および記録媒体
JP6564432B2 (ja) 機械学習装置、制御システム、制御装置、及び機械学習方法
JP6748135B2 (ja) 機械学習装置、サーボ制御装置、サーボ制御システム、及び機械学習方法
TWI733738B (zh) 結合多致動器的比例積分微分控制方法與系統
US11087509B2 (en) Output device, control device, and evaluation function value output method
JP6386516B2 (ja) 学習機能を備えたロボット装置
WO2021192023A1 (ja) 数値制御装置
JP2015051469A (ja) ロボット制御装置、ロボット装置、ロボット制御方法、プログラム及び記録媒体
JP2019061523A (ja) 情報処理装置、情報処理方法およびプログラム
WO2023044878A1 (zh) 运动控制方法及装置
JP2006281330A (ja) ロボットシミュレーション装置
JP2020035159A (ja) パラメータ調整装置
JP2014117787A (ja) 制御装置
CN113852321B (zh) 指令生成装置、指令生成方法
JP7179672B2 (ja) 計算機システム及び機械学習方法
WO2022074734A1 (ja) 制御システムの生産装置、制御システムの生産方法及びプログラム
WO2019167374A1 (ja) 情報処理装置および情報処理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21957996

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180101498.9

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE