CN108927806A - A kind of industrial robot learning method applied to high-volume repeatability processing - Google Patents
A kind of industrial robot learning method applied to high-volume repeatability processing Download PDFInfo
- Publication number
- CN108927806A CN108927806A CN201810921161.2A CN201810921161A CN108927806A CN 108927806 A CN108927806 A CN 108927806A CN 201810921161 A CN201810921161 A CN 201810921161A CN 108927806 A CN108927806 A CN 108927806A
- Authority
- CN
- China
- Prior art keywords
- learning
- unit
- robot
- information
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 52
- 230000006399 behavior Effects 0.000 claims description 45
- 230000002787 reinforcement Effects 0.000 claims description 24
- 230000006870 function Effects 0.000 claims description 22
- 238000013528 artificial neural network Methods 0.000 claims description 19
- 238000011156 evaluation Methods 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 19
- 238000004364 calculation method Methods 0.000 claims description 13
- 230000003252 repetitive effect Effects 0.000 claims description 13
- 230000001133 acceleration Effects 0.000 claims description 10
- 238000009795 derivation Methods 0.000 claims description 9
- 230000009471 action Effects 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 7
- 230000000007 visual effect Effects 0.000 claims description 7
- 230000003542 behavioural effect Effects 0.000 claims description 5
- 230000001186 cumulative effect Effects 0.000 claims description 5
- 238000004891 communication Methods 0.000 claims description 3
- 238000011217 control strategy Methods 0.000 abstract description 12
- 230000009514 concussion Effects 0.000 abstract 1
- 230000006872 improvement Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 238000005259 measurement Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000005484 gravity Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 238000003754 machining Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1628—Programme controls characterised by the control loop
- B25J9/163—Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
Landscapes
- Engineering & Computer Science (AREA)
- Robotics (AREA)
- Mechanical Engineering (AREA)
- Feedback Control In General (AREA)
- Manipulator (AREA)
Abstract
The present invention provides a kind of industrial robot learning methods applied to high-volume repeatability processing, it is characterised in that: the learning method is learnt based on learning model comprising following steps: S001, sensor acquisition state information;S002, learnt according to the information of acquisition;S003, judge whether processing quality and process-cycle reach requirement, terminate to learn if reaching requirement, otherwise resurvey status information and relearn.Method of the invention goes to learn and improves control strategy according to sensing data; reach good control at high speeds; robot debugging efforts can be simplified; and may be implemented in high-volume, scale repeatability processing in apply; and solve robot and lack concussion under high speed operation caused by precise kinetic model in traditional mode of learning, improve industrial machine task efficiency.
Description
Technical Field
The invention relates to the technical field of industrial robots, in particular to an industrial robot learning method applied to large-batch repetitive processing.
Background
An industrial robot is a system with a high degree of non-linearity, the accurate modeling of its dynamics is difficult to achieve. Previous robots generally only considered kinematics and not a kinetic model. When using only a dynamic model, on the one hand, the maximum speed and acceleration at each point are usually set to be lower than what can be actually tolerated, which is to say that the maximum torque of the actuator is not exceeded in view of the dynamic behavior, but this also results in an underutilization of the performance of the actuator. On the other hand, the working efficiency of the industrial robot is not influenced by considering the dynamic characteristics, and strong vibration is often generated due to the influence of inertia force, centrifugal force, friction force, gravity and joint torque force in the high-speed movement process and the heavy load process of the robot, so that the processing quality of the robot is influenced, and the service life of the robot is also influenced. In addition, the accurate modeling of the dynamics of the industrial robot also has the problem that the robot parameters are difficult to identify, if the consistency of the robot is poor, the friction coefficient of each part is different, so that the dynamics parameters are wrong, the robot debugging work is more complicated due to incorrect dynamics parameters, and the large-batch and large-scale application is difficult to realize.
Disclosure of Invention
Aiming at the defects or shortcomings in the prior art, the invention provides the industrial robot learning method applied to the large-batch repetitive processing, the control strategy is learned and improved according to the sensor data, the good control under the high speed is achieved, the debugging work of the robot can be simplified, the application in the large-batch and large-scale repetitive processing can be realized, the problem of the vibration of the robot under the high-speed work caused by the lack of an accurate dynamics model in the traditional learning mode is solved, and the working efficiency of the industrial robot is improved.
In order to achieve the above object, the present invention provides a learning method for an industrial robot, which is applied to a large-scale repetitive process, the learning method is based on a learning model, and the learning method comprises the following steps:
s001, collecting state information by a sensor;
s002, learning is carried out according to the collected information;
and S003, judging whether the processing quality and the processing period meet the requirements, if so, finishing learning, and otherwise, acquiring the state information again and re-learning.
As a further improvement of the invention, the learning model is composed of an environment unit, a robot learning unit and a processing execution unit;
the environment unit consists of a machined workpiece state measuring sensor and a robot state terminal measuring observer, wherein the machined workpiece state measuring sensor acquires visual information of a machined workpiece, and the visual information at least comprises the geometric shape and surface smoothness information of the workpiece; the robot state terminal measuring observer acquires information of the position, the speed, the acceleration and the joint torque of the robot;
the state observation unit acquires the information acquired by the environment unit through a communication line and converts the acquired information into a data format;
the data processing unit receives and processes the information converted into the data format by the state observation unit; the data processing unit comprises an incentive calculation unit and a function updating unit, wherein the incentive calculation unit sets an instant incentive r through an incentive function setting unit, the incentive calculation unit calculates the information of the state observation unit, the result parameters are transmitted to the function updating unit after calculation is completed, the function updating unit updates the acquired parameters in a neural network training mode until the final learning parameters are obtained, the final learning parameters are stored, a behavior decision is made through a neural network, and then reinforcement learning is carried out to a certainty strategy so as to drive the robot to work.
As a further improvement of the present invention, the reinforcement learning is defined by assuming that the robot is defined as a strategy pi from state information to behavior, and the cumulative reward obtained from the time t is defined as:based on accumulated reward
Obtaining an expected reward; wherein Q isπ(st,at) To representIn state s according to strategy pitTake action atExpected return on time;
combining the accumulated return and the formula of the expected return to obtain a recursive formula of the expected return:
the decision is made according to a recursive formal formula using the last updated strategy.
In the invention, a reinforcement learning mode is adopted, the reinforcement learning strategy is divided into a deterministic strategy and an uncertain strategy, the reinforcement learning mode of the deterministic strategy is adopted in the invention, namely, a mode of outputting behaviors is adopted in a certain state instead of a mode of outputting probability, and the expected return Q can be calculated by a formula (4):
where μ represents a definite behavior.
As a further improvement of the present invention, the reinforcement learning adopts a reinforcement learning manner of a deterministic strategy, and the specific process thereof includes the following steps:
s201, initializing a behavior network mu (S | theta)μ) The parameter is represented by thetaQAnd an evaluation network Q (s, a | θ)Q) The parameter is represented by thetaμAnd initializes the target network Q' (s, a | θ)Q′) And μ' (s | θ)μ′) The parameter is thetaQ′←θQ,θμ′←θμ。
S202, initializing a buffer container R;
s203, receiving the state information S of the state observation unitt;
S204, selecting the execution behavior a according to the current strategy and applying certain noiset;
S205, observing the obtained reward rtAnd observing next state information st+1;
S206, quadruple<st,at,rt,st+1>Is stored in a buffer container R;
s207, randomly selecting a batch of quadruple samples from the buffer container for training;
s208, updating and evaluating network parameters;
s209, updating the behavior network parameters;
s210, judging whether the learning times exceed a preset value or whether the processing quality is good enough;
and S211, transmitting the parameters of the evaluation network and the behavior network to a host for storage, and finishing learning.
As a further improvement of the present invention, when updating the evaluation network parameters in step S208, the objective function y is first updatedtThe method comprises the following steps: y ist=r(st,at)+γQ(st+1,μ(st+1)|θQ) Then go through the formula minaL(θQ)= E[(Qst,atθQ-yt)2]Calculating parameters to update the evaluation network, where at represents behavior at time t, Q represents cumulative rewards, and thetaQRepresenting a parameter of the behavioral network, E representing an expected value of a sum of squares of errors between actual rewards and goals for the plurality of sets of data, L (θ)Q) Expressed at a parameter thetaQError oft+1) Is shown in state st+1The following deterministic strategy.
6. The industrial robot learning method applied to the large-batch repetitive processing according to claim 3, characterized in that: when the behavior network parameters are updated in step S209, a gradient method is usedTo update the behavioral network and to update the target network, e.g. byThe following formula group;
θ′←τθ+(1-τ)θ′
θQ′←τθQ+(1-τ)θQ′
θμ′←τθμ+(1-τ)θμ′
withτ<<0.05
is expressed in the pair thetaμThe derivation is carried out by the derivation,it is indicated that the derivation of a,expressed in thetaμAs variables, find J about θμThe derivative of (c).
The invention has the beneficial effects that:
1. the method of the invention collects the processing information, uses a reinforcement learning mode to carry out learning, reduces the debugging work of the robot, optimizes the control strategy of the industrial robot, comprises a track planning function under a given path and a motor control strategy under a given track, solves the problem of the vibration of the robot under high-speed work caused by the lack of an accurate dynamic model in the traditional learning mode, and improves the working efficiency of the industrial robot.
2. The learning method is to learn the control strategy under high-speed work, and to learn and improve the control strategy according to the sensor data, so as to achieve good control under high speed.
Drawings
FIG. 1 is a schematic diagram of the learning model structure of the present invention;
FIG. 2 is a flow chart of a learning method of the present invention;
FIG. 3 is a flow chart of reinforcement learning according to the present invention.
Detailed Description
The invention is further described with reference to the following description and embodiments in conjunction with the accompanying drawings.
The learning method is obtained based on a learning model structure, and the learning model structure is also an industrial robot system; FIG. 1 is a schematic diagram of a learning model structure according to the present invention; the model consists of an environment unit, a robot learning unit and a processing execution unit, wherein the environment unit at least comprises a processing quality measuring unit, the robot learning unit comprises a state observation unit, a data processing unit and a decision making unit, and the processing execution unit at least comprises a robot and a locator.
The working process of each unit of the learning model of the invention is as follows:
the environment unit, which is a processing quality measuring unit in this implementation, is composed of a processing workpiece state measuring sensor and a robot state end measuring observer, and the processing workpiece state measuring sensor mainly collects visual information of the processing workpiece, including the geometry and surface smoothness of the workpiece. The robot state end measurement observer can also be a robot state end measurement sensor and is used for acquiring information such as the position, the speed, the acceleration, the joint torque and the like of the robot.
And the state observation unit acquires the information acquired by the processing quality measurement unit through a communication line and converts the acquired information into a data format.
The data processing unit receives and processes the information converted into the data format by the state observation unit; the data processing unit comprises a reward calculation unit and a function updating unit, wherein the reward calculation unit sets the instant reward through a reward function setting unitthe incentive calculation unit calculates the information of the state observation unit, transmits the result parameters to the function updating unit after the calculation is finished, updates the acquired parameters by the function updating unit in a neural network training mode until the final learning parameters are obtained, stores the final learning parameters, makes a behavior decision through the neural network, and performs reinforcement learning to a certainty strategy to drive the robot to workerror+γ*acceleration+u*R*uT+ …, changing the weight of the index by adjusting the parameters of α, β and, gamma, where α represents the weight of speed in the reward function, β and represents the weight of position error, gamma represents the weight of acceleration, R represents positive definite matrix, u represents voltage, current, etc., the reward calculating unit sends the calculated result to the function updating unit for updating, in this embodiment, the function updating unit updates the parameter by preferably using the neural network training mode to obtain the final learning parameter, then stores the learning result, and uses the neural network to make the action decision to drive the robot to work, then continuously collects new information, thus realizing the purpose that the strengthening learning can adjust the voltage, position, speed, acceleration, etc. under different states according to the required performance index, the process includes the track planning function and the motor control strategy
In this embodiment, a method applicable to the learning of a large-batch repetitive machining robot is designed based on a model shape as shown in fig. 1, and a flowchart of the method is shown in fig. 2, and includes the following specific steps:
s001, collecting state information by a sensor;
s002, learning is carried out according to the collected information;
and S003, judging whether the processing quality and the processing period meet the requirements, if so, finishing learning, and otherwise, acquiring the state information again and re-learning.
In this embodiment, step S001, a sensor collects status information; the robot state tail end measuring observer in the processing information measuring unit acquires position, speed, acceleration, current, voltage, vibration rate and torque information of a robot joint and the tail end of a mechanical arm; the processing workpiece state measuring sensor is mainly used for collecting visual information containing the geometric shape and surface smoothness of a processing workpiece; in the process, the gray level processing is carried out on the visual information, so that the influence of illumination is avoided. And (3) carrying out normalization processing on information such as position, speed, acceleration and the like, and unifying the information length so as to facilitate input into a neural network for processing.
In the present embodiment, in step S002, learning is performed based on the collected information, and the machine learning means performs deep reinforcement learning based on the information such as current, voltage, moment, torque, vibration, and stereoscopic vision obtained by the state observation means.
In the process of reinforcement learning, the robot is made to interact with the environment based on a discrete equal time sequence; specifically, it is assumed that in each equal time t, the state observation unit inputs an observation state to the machine learning unit, the machine learning unit inputs the state as input information to the neural network according to the current state, and the output result is a determined behavior of the motor. The robot is defined as a strategy pi from state information to behavior, and the accumulated return obtained from the moment t is as follows:
the strategies obtained by reinforcement learning can be divided into a deterministic strategy and an uncertain strategy, the uncertain strategy outputs the probability of each behavior, and the deterministic strategy directly outputs a certain behavior.
In the invention, the purpose of using reinforcement learning is to learn a deterministic strategy pi, namely to directly learn a strategy from state input to output action, namely for a behavior network, the input is state information and the output is action. Maximizing the expected return Q from the initial state can be expressed by equation (2):
Qπ(st,at)=Eπ[Rt|st,at](2)
wherein Q isπ(st,at) Representing pi in state s according to a strategytTake action atExpected return on time. In conjunction with equations (1) and (2), a recursive form of the expected reward can be derived, as in equation (3):
this means that we can make decisions during the learning process continuously using the last updated strategy.
In the invention, a reinforcement learning mode is adopted, the reinforcement learning strategy is divided into a deterministic strategy and an uncertain strategy, the reinforcement learning mode of the deterministic strategy is adopted in the invention, namely, a mode of outputting behaviors is adopted in a certain state instead of a mode of outputting probability, and the expected return Q can be calculated by a formula (4):
where μ represents a definite behavior. In this process, equation (4) is a recursive form of equation (2). Equation (2) is a general conceptual representation of the expected payback in the present invention, while equation (4) is a practical implementation that allows the calculation of the jackpot at this time to use the last jackpot value and the instant prize value at that time for ease of programming on a computer.
In the present embodiment, a specific method of reinforcement learning with a deterministic strategy includes the steps shown in fig. 3:
s201, initializing a behavior network mu (S | theta)μ) The parameter is represented by thetaQAnd an evaluation network Q (s, a | θ)Q) The parameter is represented by thetaμAnd initializes the target network Q' (s, a | θ)Q′) And μ' (s | θ)μ′) The parameter is thetaQ′←θQ,θμ′←θμ。
Specifically, the neural network evaluation network Q (s, a | θ) is initializedQ) And the behavioral network μ (s | θ)μ) The neural network structure, its neural parameters are respectively expressed as: thetaQ,θμThe weight of each neuron in the neural network is represented. Wherein, thetaQRepresenting a parameter, θ, evaluating the neural networkμRepresenting a network of behaviors, μ (s | θ)μ) The input is the state information from the state observation unit and the output is a certain behavior at. Evaluation of neural network Q (s, a | θ)Q) The inputs of (a) are state information from a state observation unit and a behavior network mu (s | theta)μ) The output is the value of a cost function that takes some action in this state, reflecting how good the current strategy is. Initializing a target network, wherein the target network has the same structure as the behavior network and the evaluation network respectively, the neural network parameters of the target network come from the slowly-transformed copies of the behavior network and the evaluation network, and the target network is updated slowly compared with the behavior network and the evaluation network so as to maintain the stability of the learning process of the neural network.
S202, initializing a buffer container;
s203, receiving the state information of the state observation unit;
s204, selecting an execution behavior according to the current strategy and applying certain noise;
specifically, for each discrete time point of a processing period, a certain random noise is added to select the behavior a according to the current strategytNamely: a ist=μ(st|θμ) + N (t) where μ: (s)=argmaxaQ(s,a)。
S205, observing the obtained reward and observing the next state information;
in particular the current state information stAs input, it is input into a behavior network, which outputs a specific behavior value atInputting the state information and the behavior value of the behavior network as input into an evaluation network, and outputting the reward r by the evaluation networktThe state observation unit collects the state information s of the next momentt+1To obtain quadruple information(s)t,at,rt,st+1)。
S206, mixing (S)t,at,rt,st+1) In a buffer container(s)t,at,rt,st+1) Is shown in state stTake action atReward r for later earningtThe size of (d) and the state of the next time;
s207, randomly selecting a batch of samples from the buffer container for training;
s208, updating and evaluating network parameters;
specifically, the objective function is set as: y ist=r(st,at)+γQ(st+1,μ(st+1)|θQ) And through the formula minθL(θQ)=E[(Q(st,at|θQ)-yt)2]Updating the evaluation network; q represents the desired jackpot, θQRepresenting a parameter of the behavioral network, E representing an expected value of a sum of squares of errors between actual rewards and goals for the plurality of sets of data, L (θ)Q) Expressed in the parameter, μ(s)t+1) Is shown in state st+1A deterministic policy under;
s209, updating the behavior network parameters;
in particular using a gradient method
Updating the behavior network, and updating the target network by adopting the following formula group;
θ′←τθ+(1-τ)θ′
θQ′←τθQ+(1-τ)θQ′
θμ′←τθμ+(1-τ)θμ′
withτ<<0.05
wherein,is expressed in the pair thetaμThe derivation is carried out by the derivation,it is indicated that the derivation of a,expressed in thetaμAs variables, find J about θμThe derivative of (c).
S210, judging whether the learning times exceed a preset value or whether the processing quality is good enough; specifically, when the learning times reach a predetermined number (for example, 100 ten thousand times) or the learned strategy already meets the application requirements (the processing quality is good), the learning is exited.
And S211, transmitting the parameters of the evaluation network and the behavior network to the host for storage, and ending.
After learning is finished, parameters of the evaluation network and the behavior network are transmitted to a host computer for storage and are obtained in processing(s)t,at,rt,st+1) Information is also transmitted to the host for storage. The host transmits the obtained information to other trained robots, and the robot learning unit adopts the same neural network after obtaining the trained neural network parametersAnd the network structure fixes the parameters to be unchangeable, and only enables the parameters of the last two layers to be changeable. And adjusting the parameters of the last two layers according to the actual processing condition of the robot.
In this implementation, R represents the instant prize and R represents the cumulative reward.
The learning method and the reinforcement learning process of the invention collect the real-time information/state information of the industrial robot, take account of the influence of inertia force, centrifugal force, friction force, gravity and joint torque force on the work of the robot in the working engineering of the robot (including high-speed motion process and heavy load process), collect the information of the position, speed, acceleration, joint torque and the like of the robot through the terminal measuring sensor of the robot state to generate vibration in the working process, update and generate the optimal strategy in real time, effectively ensure the consistency of the robot and avoid the occurrence of wrong decision.
In conclusion, the method provided by the invention has the advantages that the processing information is collected, the control strategy of the robot for debugging, working and optimizing the industrial robot is reduced by using a reinforcement learning mode, the control strategy comprises a track planning function under a given path and a motor control strategy under a given track, the problem of vibration of the robot under high-speed working caused by the lack of an accurate dynamic model in the traditional learning mode is solved, and the working efficiency of the industrial robot is improved. The learning method is mainly used for learning a control strategy under high-speed work, and learning and improving the control strategy according to sensor data to achieve good control under high speed.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (6)
1. An industrial robot learning method applied to large-batch repetitive processing is characterized in that: the learning method is based on a learning model for learning, and comprises the following steps:
s001, collecting state information by a sensor;
s002, learning is carried out according to the collected information;
and S003, judging whether the processing quality and the processing period meet the requirements, if so, finishing learning, and otherwise, acquiring the state information again and re-learning.
2. The industrial robot learning method applied to the large-batch repetitive processing according to claim 1, characterized in that: the learning model consists of an environment unit, a robot learning unit and a processing execution unit; the robot learning unit comprises a state observation unit, a data processing unit and a decision making unit, and the processing execution unit at least comprises a robot and a locator;
the environment unit consists of a machined workpiece state measuring sensor and a robot state terminal measuring observer, wherein the machined workpiece state measuring sensor acquires visual information of a machined workpiece, and the visual information at least comprises the geometric shape and surface smoothness information of the workpiece; the robot state terminal measuring observer acquires information of the position, the speed, the acceleration and the joint torque of the robot;
the state observation unit acquires the information acquired by the environment unit through a communication line and converts the acquired information into a data format;
the data processing unit receives and processes the information converted into the data format by the state observation unit; the data processing unit comprises an incentive calculation unit and a function updating unit, wherein the incentive calculation unit sets an instant incentive r through an incentive function setting unit, the incentive calculation unit calculates the information of the state observation unit, the result parameters are transmitted to the function updating unit after calculation is completed, the function updating unit updates the acquired parameters in a neural network training mode until the final learning parameters are obtained, the final learning parameters are stored, a behavior decision is made through a neural network, and then reinforcement learning is carried out to a certainty strategy so as to drive the robot to work.
3. The industrial robot learning method applied to the large-batch repetitive processing according to claim 2, characterized in that: the reinforcement learning is defined as a strategy pi from state information to behavior of the robot by assuming that the robotThe cumulative reward obtained from time t is defined as:by Q based on cumulative returnsπ(st,at)=Eπ[Rt|st,at]Obtaining an expected reward; wherein Q isπ(st,at) Representing pi in state s according to a strategytTake action atExpected return on time; combining the accumulated return and the formula of the expected return to obtain a recursive formula of the expected return:
the decision is made according to a recursive formal formula using the last updated strategy.
4. The industrial robot learning method applied to the large-batch repetitive processing according to claim 2, characterized in that: the reinforcement learning adopts a reinforcement learning mode of a deterministic strategy, and the specific process comprises the following steps:
s201, initializing a behavior network mu (S | theta)μ) The parameter is represented by thetaQAnd an evaluation network Q (s, a | θ)Q) The parameter is represented by thetaμAnd initializes the target network Q' (s, a | θ)Q′) And μ' (s | θ)μ′) The parameter is thetaQ′←θQ,θμ′←θμ。
S202, initializing a buffer container R;
s203, receiving the state information S of the state observation unitt;
S204, selecting the execution behavior a according to the current strategy and applying certain noiset;
S205, observing the obtained reward rtAnd observing next state information st+1;
S206, quadruple<st,at,rt,st+1>Is stored in a buffer container;
s207, randomly selecting a batch of quadruple samples from the buffer container for training;
s208, updating and evaluating network parameters;
s209, updating the behavior network parameters;
s210, judging whether the learning times exceed a preset value or whether the processing quality is good enough;
and S211, transmitting the parameters of the evaluation network and the behavior network to a host for storage, and finishing learning.
5. The industrial robot learning method applied to the large-batch repetitive processing according to claim 3, characterized in that: when updating the evaluation network parameters in step S208, the objective function y is first updatedtThe method comprises the following steps: y ist=r(st,at)+γQ(st+1,μ(st+1)|θQ) Then go through the formula minθL(θQ)=E[(Q(st,at|θQ)-yt)2]Calculating parameters to update the evaluation network, wherein atRepresenting behavior at time t, Q representing the desired jackpot, thetaQRepresenting a parameter of the behavioral network, E representing an expected value of a sum of squares of errors between actual rewards and goals for the plurality of sets of data, L (θ)Q) Expressed at a parameter thetaQError oft+1) Is shown in state st+1The following deterministic strategy.
6. The industrial robot learning method applied to the large-batch repetitive processing according to claim 3, characterized in that: when the behavior network parameters are updated in step S209, a gradient method is usedUpdating the behavior network, and updating the target network by adopting the following formula group;
θ′←τθ+(1-τ)θ′
θQ′←τθQ+(1-τ)θQ′
θμ′←τθμ+(1-τ)θμ′
withτ<<0.05
is expressed in the pair thetaμThe derivation is carried out by the derivation,it is indicated that the derivation of a,expressed in thetaμAs variables, find J about θμThe derivative of (c).
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810921161.2A CN108927806A (en) | 2018-08-13 | 2018-08-13 | A kind of industrial robot learning method applied to high-volume repeatability processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810921161.2A CN108927806A (en) | 2018-08-13 | 2018-08-13 | A kind of industrial robot learning method applied to high-volume repeatability processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108927806A true CN108927806A (en) | 2018-12-04 |
Family
ID=64445042
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810921161.2A Pending CN108927806A (en) | 2018-08-13 | 2018-08-13 | A kind of industrial robot learning method applied to high-volume repeatability processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108927806A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110293560A (en) * | 2019-01-12 | 2019-10-01 | 鲁班嫡系机器人(深圳)有限公司 | Robot behavior training, planing method, device, system, storage medium and equipment |
CN114630734A (en) * | 2019-09-30 | 2022-06-14 | 西门子股份公司 | Visual servoing with dedicated hardware acceleration to support machine learning |
CN114925988A (en) * | 2022-04-29 | 2022-08-19 | 南京航空航天大学 | Machining task driven multi-robot collaborative planning method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106392266A (en) * | 2015-07-31 | 2017-02-15 | 发那科株式会社 | Machine learning device, arc welding control device, and arc welding robot system |
JP2017102613A (en) * | 2015-11-30 | 2017-06-08 | ファナック株式会社 | Machine learning device and method for optimizing smoothness of feeding of feed shaft of machine and motor control device having machine learning device |
CN107199397A (en) * | 2016-03-17 | 2017-09-26 | 发那科株式会社 | Machine learning device, laser-processing system and machine learning method |
US20180079076A1 (en) * | 2016-09-16 | 2018-03-22 | Fanuc Corporation | Machine learning device, robot system, and machine learning method for learning operation program of robot |
CN108052004A (en) * | 2017-12-06 | 2018-05-18 | 湖北工业大学 | Industrial machinery arm autocontrol method based on depth enhancing study |
CN108202327A (en) * | 2016-12-16 | 2018-06-26 | 发那科株式会社 | Machine learning device, robot system and machine learning method |
-
2018
- 2018-08-13 CN CN201810921161.2A patent/CN108927806A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106392266A (en) * | 2015-07-31 | 2017-02-15 | 发那科株式会社 | Machine learning device, arc welding control device, and arc welding robot system |
JP2017102613A (en) * | 2015-11-30 | 2017-06-08 | ファナック株式会社 | Machine learning device and method for optimizing smoothness of feeding of feed shaft of machine and motor control device having machine learning device |
CN107199397A (en) * | 2016-03-17 | 2017-09-26 | 发那科株式会社 | Machine learning device, laser-processing system and machine learning method |
US20180079076A1 (en) * | 2016-09-16 | 2018-03-22 | Fanuc Corporation | Machine learning device, robot system, and machine learning method for learning operation program of robot |
CN108202327A (en) * | 2016-12-16 | 2018-06-26 | 发那科株式会社 | Machine learning device, robot system and machine learning method |
CN108052004A (en) * | 2017-12-06 | 2018-05-18 | 湖北工业大学 | Industrial machinery arm autocontrol method based on depth enhancing study |
Non-Patent Citations (1)
Title |
---|
刘全,翟建伟,章宗长,钟珊,周倩,章鹏,徐进: "深度强化学习综述", 《计算机学报》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110293560A (en) * | 2019-01-12 | 2019-10-01 | 鲁班嫡系机器人(深圳)有限公司 | Robot behavior training, planing method, device, system, storage medium and equipment |
CN114630734A (en) * | 2019-09-30 | 2022-06-14 | 西门子股份公司 | Visual servoing with dedicated hardware acceleration to support machine learning |
CN114925988A (en) * | 2022-04-29 | 2022-08-19 | 南京航空航天大学 | Machining task driven multi-robot collaborative planning method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP6810087B2 (en) | Machine learning device, robot control device and robot vision system using machine learning device, and machine learning method | |
CN107102644B (en) | Underwater robot track control method and control system based on deep reinforcement learning | |
JP6616170B2 (en) | Machine learning device, laminated core manufacturing apparatus, laminated core manufacturing system, and machine learning method for learning stacking operation of core sheet | |
JP6219897B2 (en) | Machine tools that generate optimal acceleration / deceleration | |
CN108927806A (en) | A kind of industrial robot learning method applied to high-volume repeatability processing | |
CN111618862B (en) | Robot operation skill learning system and method under guidance of priori knowledge | |
JP6077617B1 (en) | Machine tools that generate optimal speed distribution | |
CN116460860B (en) | Model-based robot offline reinforcement learning control method | |
CN115812180A (en) | Robot-controlled offline learning using reward prediction model | |
CN113043275B (en) | Micro-part assembling method based on expert demonstration and reinforcement learning | |
JP6457382B2 (en) | Machine learning device, industrial machine system, manufacturing system, machine learning method and machine learning program for learning cash lock | |
CN111783994A (en) | Training method and device for reinforcement learning | |
CN112571420B (en) | Dual-function model prediction control method under unknown parameters | |
CN113614743A (en) | Method and apparatus for operating a robot | |
CN116494247A (en) | Mechanical arm path planning method and system based on depth deterministic strategy gradient | |
JP2021501433A (en) | Generation of control system for target system | |
Leyendecker et al. | Deep Reinforcement Learning for Robotic Control in High-Dexterity Assembly Tasks—A Reward Curriculum Approach | |
CN115416024A (en) | Moment-controlled mechanical arm autonomous trajectory planning method and system | |
CN114415507B (en) | Deep neural network-based smart hand-held process dynamics model building and training method | |
CN117787384A (en) | Reinforced learning model training method for unmanned aerial vehicle air combat decision | |
CN111914361B (en) | Wind turbine blade rapid design optimization method based on reinforcement learning | |
CN114186498A (en) | Robot joint friction model parameter identification method based on improved wolf algorithm | |
Kaur et al. | Learning robotic skills through reinforcement learning | |
CN113096153A (en) | Real-time active vision method based on deep reinforcement learning humanoid football robot | |
CN117444978B (en) | Position control method, system and equipment for pneumatic soft robot |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181204 |
|
RJ01 | Rejection of invention patent application after publication |