CN114667852A

CN114667852A - Hedge trimming robot intelligent cooperative control method based on deep reinforcement learning

Info

Publication number: CN114667852A
Application number: CN202210248923.3A
Authority: CN
Inventors: 蒙艳玫; 李科; 缪祥烜; 韦锦; 韩冰; 武豪
Original assignee: Guangxi University
Current assignee: Guangxi University
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-28
Anticipated expiration: 2042-03-14
Also published as: CN114667852B

Abstract

The invention discloses an intelligent cooperative control method of a hedge trimming robot based on deep reinforcement learning, which comprises the following steps: establishing an MDP deep reinforcement learning model of the hedge trimming robot; building a deep neural network framework; designing a strategy network objective function and a value function network objective function of an improved PPO algorithm; training a deep neural network by adopting an improved PPO algorithm according to a maximization strategy network target reward function and a minimization function network target function mean square error principle; an Adam adaptive gradient algorithm with an improved adaptive learning rate is adopted to optimize an objective function, an optimal strategy of a training model of the hedgerow trimming robot is obtained through repeated updating iteration, the optimal action can be predicted and output through inputting latest state data, and control instructions of the mobile chassis and the trimming mechanical arm are output. According to the method, the hedge trimming robot is not required to be physically modeled, the control error caused by inaccurate model is avoided, the algorithm is prevented from falling into a local optimal solution, the updating efficiency of the algorithm is accelerated, and meanwhile the generalization capability of the control algorithm is improved.

Description

Hedge trimming robot intelligent cooperative control method based on deep reinforcement learning

Technical Field

The invention relates to the technical field of control, in particular to a hedge trimming robot intelligent cooperative control method based on deep reinforcement learning and used for highway hedge trimming.

Background

With the gradual increase of the trimming task amount of the expressway green isolation belt, in view of low manual trimming efficiency, automatic equipment such as automatic green maintenance vehicles, vehicle-mounted hedge trimming vehicles, unmanned trimming robots and the like are applied in the industry. The unmanned pruning robot mainly comprises an unmanned chassis and a pruning manipulator, and the unmanned chassis and the manipulator need to be cooperatively controlled in real time during pruning operation, so that the robot can be close to and accurately position a hedgerow by adjusting the posture of the robot and prune the hedgerow.

At present, some research results have been found in the field in a cooperative control method for a mobile manipulator, for example:

chinese patent application No. CN109176525A discloses a mobile manipulator adaptive control method based on Radial Basis Function (RBF) neural network. The method comprises the steps of performing dynamic modeling on the mobile manipulator, building an RBF neural network of a robot dynamic model, designing a mobile manipulator trajectory tracking method by using the neural network, performing online compensation and identification on unknown dynamic parameters to further realize cooperative control of the mobile platform and the manipulator, and improving the dynamic performance of the mobile manipulator and the trajectory tracking precision of joint space.

Chinese patent application publication No. CN201510269642.6 discloses a method for controlling a mobile manipulator based on GPS and binocular vision positioning, where the mobile manipulator moves to a position near a target object according to position information of the target object obtained by the GPS, the binocular vision obtains three-dimensional information of the object, an end effector of the mobile manipulator approaches the target object quickly based on vision feed-forward, the manipulator is controlled to center the object and control the end effector to grasp the object based on vision feedback, and the vision feed-forward and feedback control enables the mobile manipulator to approach the target quickly and grasp the target after accurate positioning, thereby improving control efficiency.

The existing various mathematical analysis methods (inverse Jacobian matrix, model prediction control and the like) based on the system model or a plurality of methods have the defects of difficult system model establishment, inaccurate model, complex and tedious calculation, low system response speed, large position error accumulation, poor control system self-adaptive capacity, poor anti-jamming capability, poor robustness and the like. However, the existing solutions based on intelligent algorithms (such as neural networks and reinforcement learning) have the defects of incomplete data information, unclear and incomplete data sources, inaccurate environment models, long system response time, low algorithm learning efficiency and the like, and cannot fully ensure the positioning accuracy, the smoothness of joint operation, the response speed, the safety and other requirements of the robot system.

Disclosure of Invention

The invention aims to provide an intelligent cooperative control method of a hedge trimming robot based on deep reinforcement learning, which omits the complicated process of manual modeling calculation in the traditional method, can update a control strategy in real time, improves the dynamic response characteristic of a system, accelerates the update efficiency of an algorithm, avoids falling into a local optimal solution, and simultaneously improves the generalization capability of the control algorithm.

The hedgerow trimming robot comprises a movable chassis and a trimming mechanical arm fixed on the movable chassis, wherein a vision detection module is installed on the hedgerow trimming robot; the vision detection module comprises a hedge cross section detection camera arranged at the tail end of the trimming mechanical arm, a hedge height and distance detection camera arranged at the base of the trimming mechanical arm, and a front side lane line detection camera arranged on the moving chassis;

the invention provides an intelligent cooperative control method of a hedge trimming robot based on deep reinforcement learning, which comprises the following steps:

establishing a Markov Decision (MDP) deep reinforcement learning model of a hedge trimming robot;

step two, building a deep neural network frame;

Thirdly, designing a strategy network objective function and a value function network objective function of the improved PPO algorithm;

step four, training a deep neural network by adopting an improved PPO algorithm according to a maximum strategy network target reward function and a minimum value function network target function mean square error principle;

and fifthly, optimizing a target function by adopting an Adam adaptive gradient algorithm for improving the adaptive learning rate, obtaining an optimal strategy of the hedge trimming robot training model through repeated updating iteration, predicting and outputting optimal actions by inputting latest state data, and outputting control instructions of the mobile chassis and the trimming mechanical arm.

Preferably, in the MDP deep reinforcement learning model of the hedge trimming robot established in step one: the Markov Decision (MDP) process of the hedge trimming robot is described by a quintuple (S, A, P, R, gamma), wherein S represents a state set, A represents an action set, P represents a state transition probability, the value is 0 to 1, R is a reward function, gamma is a reward discount factor, the value is 0 to 1, and the MDP process is used for calculating the accumulated reward obtained in the interaction process of the intelligent agent and the environment; the intelligent agent is a vehicle-arm cooperative control module of the hedge trimming robot, and the environment comprises the hedge trimming robot, a hedge and a lane line; receiving the state S of the current environment moment by a strategy model of the hedge trimming robot _tSelecting and performing action A_tThen with a probability P (S) according to the environment model_t+1|S_t，A_t) Enter a new state S_t+1And earn a prize R_t+1Policy model re-acceptance State S_t+1And R_t+1Continuously generating and executing a control instruction of the hedge trimming robot, continuously optimizing and adjusting the strategy model according to the maximum reward obtained in the process until a certain condition is met, and finishing interaction between the intelligent agent and the environment;

wherein the state set comprises the following eight parts: position P of center of hedge cross section in pixel coordinate system of hedge cross section detection camera_C＝(X_c，Y_c) The height H of the hedge, the distance L from the center of the longitudinal section of the hedge to the origin of coordinates of the trimming mechanical arm, and the position of the lane line under the coordinate system of the front lane line detection cameraAnd the position of the hedgerow in the vehicle coordinate system

Pose of each joint of trimming mechanical arm

Position of hedgerow under coordinate system of trimming mechanical arm end effector

Pose of mobile chassis under world coordinate system

Wherein, the position P of the center of the cross section of the hedge under the pixel coordinate system of the detection camera of the cross section of the hedge_C＝(X_c，Y_c) The acquisition method comprises the following steps: the hedgerow image is shot through the hedgerow cross section detection camera and feature extraction is carried out, a feature template of the hedgerow cross section shape is obtained, when the tail end of the trimming mechanical arm moves right above the hedgerow, the hedgerow cross section shape is detected and matched with the feature template, after the matching coincidence degree is larger than 80%, the hedgerow cross section feature recognition can be considered to be effective, and a cross section center coordinate value P is output _C＝(X_c，Y_c)；

The method for acquiring the height H of the hedgerow and the distance L from the center of the longitudinal tangent plane of the hedgerow to the coordinate origin of the trimming mechanical arm comprises the following steps: shooting a hedge image through a hedge height and distance detection camera, and acquiring a hedge height H and a distance L from the center of a longitudinal section of the hedge to the coordinate origin of the trimming mechanical arm;

the method for acquiring the position of the lane line under the coordinate system of the front lane line detection camera comprises the following steps: the front lane line detection camera takes a picture of a front lane, and obtains a left front (x) line and a right front (x) line of a lane line in a front ROI area from the picture_flf，y_flf) Left rear (x)_flb，y_flb)、Front right (x)_frf，y_frf) Right rear (x)_frb，y_frb) The coordinate values of (2).

Wherein the action set comprises the following three parts: motion velocity component of each joint of trimming mechanical arm

Longitudinal speed of moving chassis, front wheel slip angle

Wherein the reward function comprises the following parts: the operation smoothness of each joint of the trimming mechanical arm is rewarded, the distance between a hedge and an end actuator of the trimming mechanical arm is rewarded, and the tracking precision of the movable chassis to the lane line is rewarded;

wherein, the reward of running smoothness of each joint of the trimming mechanical arm is represented as:

wherein i is 0,1,2, 3; d omega_ikThe angular velocity differential of the servo motor of each joint of the trimming mechanical arm, which is output by the system, represents the running stability of the motor, a ₁，b₁Is constant and a₁<0，b₁>0；

Wherein the distance reward between the hedge and the end effector of the trimming robot is expressed as:

r_dist＝a2(d_dist)²+c

in the formula a₂C is a constant and a₂<0，c>0，r_distIs the distance between the hedge and the coordinate origin of the trimming mechanical arm end effector,

when the end effector of the trimming mechanical arm is closer to the hedgerow, the user can learnThe greater the value of the reward that can be obtained;

wherein, the reward of the tracking precision of the mobile chassis to the lane line is represented as:

in the formula, a₃，b₂Is constant and a₃<0，b₂>0，K₁，K₂The number of the intelligent agent is constant, the boundary of the coordinate range of the lane center line is represented, x represents the value of the lane center line, and the reward value of the intelligent agent is larger when the value of x is closer to the center line and is smaller otherwise;

in summary, the reward function is expressed as:

R＝ω₁*r_arms+ω₂*r_dist+ω₃*r_track

in the formula, ω₁，ω₂，ω₃The values of the weights of the pruning robot system on the reward functions of all parts are 0 to 1, and the larger the weight is, the better the performance of the trained prediction model in the aspect is.

Preferably, in the second step, a deep neural network framework of the strategy network and a deep neural network framework of the value function network are respectively built by adopting a fully-connected neural network model.

Preferably, the policy network objective function of the improved PPO algorithm in step three is expressed as:

respectively representing the original strategy and the updated strategy in a single round; n is a radical of _kFor a set of trajectories obtained after executing a policy in an environment, N_k＝{τ_iDenoted by τ is a set of state-action sequences S₀，A₀，R₀…S_t，A_t，R_t}，

Representing the ratio between the old strategy and the new strategy; the epsilon is a hyperparametric truncation factor,

to use the merit function of generalized merit estimation (GAE), defined as

In the formula V (S)_t) In order to estimate the state cost function, lambda is an introduced hyper-parameter and takes the value from 0 to 1; e is the same as₀The initial truncation factor is N, the training frequency is already finished, and the set total training frequency is represented by N;

the value function network objective function of the improved PPO algorithm in step three is represented as:

in the formula, V_φ(S_t) Is a state cost function.

Preferably, the learning rate in the optimizer of Adam adaptive gradient algorithm described in step five is represented as:

wherein, alpha is the initial step length, beta₁，β₂An exponential decay rate is estimated for a moment, e 0, 1, which is a small constant with a stable value, usually 10^-8，m_t，v_tThe biased first moment estimate and the biased second moment estimate, respectively, are calculated from the objective function gradient, and Down _ bdy _ a and Up _ bdy _ a, respectively, represent the lower and upper bounds of the learning rate, and are expressed as:

in the formula, N is the current training cycle number, N is the preset total training cycle number, and in the process of optimizing the objective function, the lower bound of the learning rate is set as the learning rate value when the objective function of the strategy network starts to rise or the learning rate value when the value function network starts to fall; the upper bound of the learning rate is set as a learning rate value when the objective function of the strategy network starts to fall or a learning rate value when the value function network starts to rise; by adaptively truncating the learning rate in a specified interval, keeping the upper bound of the learning rate unchanged in the early stage of training, continuously improving the lower bound of the learning rate, keeping the lower bound of the learning rate unchanged in the later stage of training, and continuously reducing the lower bound of the learning rate, a relatively large learning rate can be obtained in the early stage of training to enable the target function to jump out of a local optimal solution, and the learning rate is monotonically reduced in the later stage of training to ensure that the target function is monotonically converged and does not diverge;

Respectively optimizing a strategy network and a value function network objective function through the optimizer, and updating network parameters, wherein the updating process is represented as:

θ′＝θ+Δα

an Adam adaptive gradient algorithm for improving the adaptive learning rate is adopted to optimize the objective function, the optimal strategy of the hedge trimming robot training model is obtained through repeated updating iteration, the optimal action can be predicted and output through inputting the latest state data, and the control instructions of the mobile chassis and the trimming mechanical arm are output.

The invention has the advantages of

(1) The invention provides a hedge trimming robot intelligent cooperative control method based on deep reinforcement learning, which trains a deep neural network model by adopting an improved near-end strategy optimization (PPO) algorithm, avoids the complexity of manually adjusting parameters of the traditional PPO algorithm, improves the performance and generalization capability of the algorithm while ensuring the convergence speed of the algorithm, optimizes a deep neural network target function by adopting an improved Adam adaptive gradient algorithm, adaptively adjusts the learning rate, enables the algorithm to jump out of a local optimal solution, accelerates the convergence speed of the algorithm and improves the generalization capability.

(2) Because the highway operation environment is dangerous, the invention identifies the lane line by the camera arranged at the front side of the mobile chassis, respectively detects the coordinate values of the straight lines at the left side and the right side of the lane line under the camera coordinate system, limits the lane line in a certain coordinate range, realizes the lane line tracking, enables the hedge trimming robot to work in a safe area, meets the independent motion path of the mobile chassis, and simultaneously coordinates the motion of the trimming mechanical arm, thereby ensuring higher operability and tail end tracking precision.

(3) Compared with the existing control method, the intelligent cooperative control method for the hedge trimming robot based on the deep reinforcement learning omits the complex process of manual modeling calculation of the traditional method, avoids control errors caused by model inaccuracy, does not consider the constraint limitation of a mobile platform, has higher universality, can realize the automatic operation of the hedge trimming robot, and improves the automation level and the intelligent level.

Drawings

Fig. 1 illustrates a flow of steps of an intelligent cooperative control method for a hedge trimming robot according to the present invention;

FIG. 2 illustrates a schematic diagram of a Markov Decision (MDP) deep reinforcement learning model of a hedge trimming robot in accordance with the present invention;

fig. 3 shows a hedge trimming robot configuration and hedge height H and distance L detection scheme according to the present invention;

FIG. 4 shows a schematic diagram of the detection principle of the center of the cross section of the hedgerow;

FIG. 5 is a schematic view showing a principle of lane line recognition detection;

in the figure, 1-end effector pixel coordinate system, 2-feature template of hedge cross section shape, 3-center position of hedge cross section, 4-hedge cross section, 5-lane line, 6-moving chassis, 7-front lane line detection camera, 8-trimming mechanical arm Base, 9-hedge height and distance detection camera, 10-hedge cross section detection camera, 11-saw blade of manipulator end effector, 12-hedge, 13-highway guardrail, 14-front lane line detection pixel coordinate system, 15-ROI area, 16-left side target straight line front left coordinate (x)_flf，y_flf) 17-left target straight line left rear coordinate (x)_flb，y_flb) 18-Right Forward coordinate of Right target straight line (x)_frf，y_frf) 19-Right rear coordinate (x) of Right target straight line_frb，y_frb) 20-lane line.

Detailed Description

Specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings, but it should be understood that the scope of the present invention is not limited to the specific embodiments.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention. Throughout the specification and claims, unless explicitly stated otherwise, the term "comprise" or variations such as "comprises" or "comprising", etc., will be understood to imply the inclusion of a stated element or component but not the exclusion of any other element or component.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the invention. It will be understood by those skilled in the art that the present invention may be practiced without some of these specific details. In some instances, methods, instrumentalities, and elements well known to those skilled in the art have not been described in detail so as not to obscure the present invention.

Embodiment is an intelligent cooperative control method for hedge trimming robot based on deep reinforcement learning

The hedge trimming robot performs hedge trimming operation on the expressway, the lane line width is 120mm, and the hedge height and distance are known;

as shown in fig. 3, the hedge trimming robot comprises a mobile chassis 6 and a 4-degree-of-freedom trimming mechanical arm fixed on the mobile chassis, wherein a saw blade 11 is arranged on an end effector (a cutter) of the trimming mechanical arm and is used for trimming hedges; install visual detection module on the hedge trimming robot, include: a hedge cross section detection camera 10 installed at the tail end of the trimming mechanical arm, a hedge height and distance detection camera 9 installed at a base 8 of the trimming mechanical arm, and a front lane line detection camera 7 installed at the front side of the moving chassis 6; the vision sensor chip is arranged in each camera, and an algorithm is embedded in the vision sensor chip, so that the functions of RGB image acquisition, image processing and recognition and positioning can be realized.

As shown in fig. 1, the intelligent cooperative control method for the hedge trimming robot based on the deep reinforcement learning includes the following steps:

establishing a Markov Decision (MDP) deep reinforcement learning model of a hedgerow trimming robot;

as shown in fig. 2, the Markov Decision (MDP) process of the hedge trimming robot is described by a quintuple (S, a, P, R, γ), where S represents a state set, a represents an action set, P represents a state transition probability, and takes a value from 0 to 1, R is a reward function, γ is a reward discount factor, and takes a value from 0 to 1, and is used to calculate the accumulated reward obtained by the interaction process between the agent and the environment; the intelligent agent is a vehicle-arm cooperative control module of the hedge trimming robot, and the environment comprises the hedge trimming robot, a hedge and a lane line; the strategy model of the hedge trimming robot receives the state S of the environment at the current moment_tSelecting and performing action A_tThen with a probability P (S) according to the environment model_t+1|S_t，A_t) Enter a new state S_t+1And earn a reward R_t+1Policy modelType re-acceptance state S_t+1And R_t+1Continuously generating and executing a control instruction of the hedge trimming robot, continuously optimizing and adjusting the strategy model according to the maximum reward obtained in the process until a certain condition is met, and finishing interaction between the intelligent agent and the environment;

Wherein the state set comprises the following eight parts: position P of center of hedgerow cross section under pixel coordinate system of hedgerow cross section detection camera_C＝(X_c，Y_c) The height H of the hedge, the distance L from the center of the longitudinal section of the hedge to the origin of coordinates of the trimming mechanical arm, the position of the lane line under the front lane line detection camera coordinate system, and the position of the hedge under the vehicle coordinate system

Pose of each joint of trimming mechanical arm

Position of hedgerow under coordinate system of end effector of mechanical arm of the trimming

Pose of mobile chassis under world coordinate system

In this embodiment, the state set is represented as:

state space dimension | S | ═ 25;

as shown in FIG. 4, the position P of the center of the hedge cross section in the pixel coordinate system of the hedge cross section inspection camera_C＝(X_c， Y_c) The acquisition method comprises the following steps: the hedgerow image is shot through the hedgerow cross section detection camera and feature extraction is carried out, a feature template of the hedgerow cross section shape is obtained, when the tail end of the trimming mechanical arm moves right above the hedgerow, the hedgerow cross section shape is detected and matched with the feature template, after the matching coincidence degree is larger than 80%, the hedgerow cross section feature recognition can be considered to be effective, and a cross section center coordinate value P is output_C＝(X_c，Y_c)；

As shown in fig. 3, the method for acquiring the hedge height H and the distance L from the center of the longitudinal tangent plane of the hedge to the coordinate origin of the trimming mechanical arm comprises the following steps: the hedgerow height and distance detection camera is used for shooting hedgerow images, and the hedgerow height H and the distance L from the center of the longitudinal section of the hedgerow to the coordinate origin of the trimming mechanical arm are obtained;

As shown in fig. 5, the method for acquiring the position of the lane line in the front lane line detection camera coordinate system includes: the front lane line detection camera takes a picture of a front lane, and obtains a left front (x) line and a right front (x) line of a lane line in a front ROI area from the picture_flf， y_flf) Left posterior (x)_flb，y_flb) Right front (x)_frf，y_frf) Right rear (x)_frb，y_frb) The coordinate values of (a);

Longitudinal speed of moving chassis, front wheel slip angle

In this embodiment, the action set is represented as:

dimension | a | ═ 6 of the motion space;

wherein, the reward function comprises the following parts: the operation smoothness of each joint of the trimming mechanical arm is rewarded, the distance between a hedge and an end actuator of the trimming mechanical arm is rewarded, and the tracking precision of the movable chassis to the lane line is rewarded;

wherein i is 0,1,2, 3; d omega_ikThe angular velocity differential of each joint servo motor of the trimming mechanical arm output by the system represents the running stability of the motor, a₁，b₁Is constant and a₁<0，b₁>0；a₁，b₁Respectively a quadratic term coefficient and a first order term coefficient of the smoothness reward function of the mechanical arm joint of the pruning machine, wherein the function is a unary quadratic function, and the quadratic term coefficient a ₁< 0, coefficient of first order term b₁Is more than 0, and both can take any real number;

in this embodiment, the smoothness reward function for each joint of the trimming mechanical arm is expressed as:

wherein the distance reward between the hedgerow and the end effector of the trimming robot is expressed as:

r_dist＝a₂(d_dist)²+c

in the formula a₂C is a constant and a₂<0，c>0，a₂Is the coefficient of the quadratic term of the unary quadratic function and is less than 0, c is a constant term, and the two can take any real number, r_distIs the distance between the hedge and the coordinate origin of the end effector of the trimming mechanical arm,

when the end executor of the trimming mechanical arm is closer to the hedgerow, the intelligent agent obtains a larger reward value;

in this embodiment, the distance reward function of the hedge from the trimming robot end effector is expressed as: r is_dist＝- (d_dist)²+100；

in the formula, a₃，b₂Is constant and a₃<0，b₂>0，a₃Is the coefficient of quadratic term of the unitary quadratic function and is less than 0, b₂Is a first order coefficient and is greater than 0, and the two are arbitrary real numbers, K₁，K₂The number of the intelligent agent is constant, the boundary of the coordinate range of the lane center line is represented, x represents the value of the lane center line, and the reward value of the intelligent agent is larger when the value of x is closer to the center line and is smaller otherwise;

in this embodiment, the tracking accuracy reward function of the mobile chassis to the lane line is expressed as:

In summary, the reward function is represented as:

R＝ω₁*r_arms+ω₂*r_dist+ω₃*r_track

Step two, building a deep neural network framework;

when the hedge trimming robot cooperative control training is carried out by using the deep reinforcement learning, two deep neural network models, namely a strategy network and a value function network, are included, and in a training task, because the dimensionality of the network models on state input and action output is low, the data processing requirements of the training task can be met by adopting a fully-connected neural network (MLP) model with a simple structure, so that deep neural network frames of the strategy network and the value function network can be respectively built by adopting the fully-connected neural network models;

in this embodiment, the policy network adopts a structure setting of 2 hidden layers and 1 output layer, because the state space dimension of the input layer is 25, the action space dimension of the output layer is 6, the requirement on network information capacity is small, the number of neurons of the hidden layers can be set to 128, and in the network structure, because the output action is the motor expected angular velocity, the longitudinal velocity of the mobile chassis and the front wheel rotation angle, and has a positive and negative score, the output layer height adopts a TanH function to normalize the output action, and the hidden layers adopt a ReLU as an activation function to improve the network training efficiency; the value function network adopts the structural setting of 3 hidden layers and 1 output layer, in the value function network, the input is the state and the action, the output is the Q value function, the input action is output to the network in the second full-connection hidden layer, because the dimension of the input and the output is larger, more neurons need to be set to increase the data processing capability, so in the network structure, the number of the neurons in the 1 st hidden layer is set to be 128, the number of the neurons in the 2 nd and the 3 rd hidden layers is set to be 256, and the ReLU is adopted as the activation function to improve the network training efficiency.

in a policy network, to seek a policy π_θ(A_t|S_t) After the training is finished, the optimal parameter θ is usually trained by a maximum strategic network objective reward function, and the PPO-clip with the clip item can be used as a strategic network optimization objective function, that is:

set of derived trajectories, N_k＝{τ_iDenoted by τ is a set of state-action sequences S₀，A₀，R₀…S_t，A_t，R_t}，

The ratio between the old strategy and the new strategy is expressed, epsilon is a hyperparametric truncation factor,

to use the generalized dominance function of the dominance estimation (GAE), is defined as

In the formula V (S)_t) In order to estimate the state cost function, lambda is an introduced hyper-parameter and takes the value of 0 to 1;

in a strategy network objective function, a hyper-parameter truncation factor epsilon is manually designed during training, the value size of the hyper-parameter truncation factor epsilon limits the size of an objective function confidence interval, the difference of a new strategy and an old strategy after PPO-clip requirement updating cannot be too large, when the value is too small, the performance of a strategy optimization process can be stably improved, but the algorithm training time is increased, when the value is too large, the updating of a strategy network can be unstable, when the strategy network objective function is used for training a hedge trimming robot model, the value size both ensures that the strategy network can be rapidly converged, and simultaneously meets the algorithm stability requirement, therefore, the hyper-parameter truncation factor can be designed as follows:

Wherein, is E₀The initial truncation factor is N, the training frequency is already finished, the set total training frequency is represented by N, and 1 is used for preventing the situation that the denominator is 0 is meaningless; enabling the hyperparameter truncation factor to be in an adaptive nonlinear way along with the training processAttenuation, namely, the strategy network can be ensured to accelerate the updating amplitude of the algorithm in the early training stage and ensure the algorithm to be stably converged in the later training stage;

in summary, the policy network objective function for improving the PPO algorithm is expressed as:

in an embodiment, an initial truncation factor e is set₀0.3, the total frequency of training is 10⁶；

In a value function network, the value function network is often trained with a minimum mean square error, and therefore, the value function network objective function of the improved PPO algorithm is expressed as:

in the formula, V_φ(S_t) Is a state cost function.

And step four, training the deep neural network by adopting an improved PPO algorithm according to the mean square error principle of the network objective function of the maximization strategy network objective reward function and the minimization value function network objective function.

Step five, optimizing a target function by adopting an Adam adaptive gradient algorithm for improving the adaptive learning rate, obtaining an optimal strategy of the training model of the hedgerow trimming robot through repeated updating iteration, predicting and outputting optimal actions by inputting latest state data, and outputting control instructions of the mobile chassis and the trimming mechanical arm; wherein, in the optimizer of the Adam adaptive gradient algorithm, the learning rate is expressed as:

Wherein α is the initial step size, β₁，β₂An exponential decay rate is estimated for a moment, e 0, 1, which is a small constant with a stable value, usually 10^-8，m_t，v_tRespectively representing the first order moment estimate with deviationEstimating the second moment of the deviation, which is calculated by the gradient of the objective function; in this embodiment, the initial step size α is 0.001, and the moment is estimated as the exponential decay rate β₁＝0.9，β₂0.999; down _ bdy _ a and Up _ bdy _ a represent the lower and upper bounds of the learning rate, respectively, and are represented as:

in the formula, N is the current training cycle number, and N is the preset training total cycle number; in this embodiment, the preset total training cycle number N is 10⁶；

In the process of optimizing the objective function, the lower bound of the learning rate is set as a learning rate value when the objective function of the strategy network starts to rise or a learning rate value when the value function network starts to fall; the upper bound of the learning rate is set as a learning rate value when the objective function of the strategy network starts to fall or a learning rate value when the value function network starts to rise; by adaptively truncating the learning rate in a specified interval, keeping the upper bound of the learning rate unchanged in the early stage of training, continuously improving the lower bound of the learning rate, keeping the lower bound of the learning rate unchanged in the later stage of training, and continuously reducing the lower bound of the learning rate, a relatively large learning rate can be obtained in the early stage of training to enable the target function to jump out of a local optimal solution, and the learning rate is monotonically reduced in the later stage of training to ensure that the target function is monotonically converged and not diverged;

In this embodiment, the number of cycles of the rising and falling learning rates is equal to

I.e. 0.4 x 10⁶Second, the convergence period is

I.e. 0.2 x 10⁶；

Respectively optimizing a policy network and a value function network objective function through the optimizer, and updating network parameters, wherein the updating process can be expressed as:

θ′＝θ+Δα

Through the intelligent cooperative control method for the hedge trimming robot based on the deep reinforcement learning, the tedious process of manual modeling calculation in the traditional method is omitted, the control error caused by inaccurate model is avoided, the constraint limitation of a mobile platform is not considered, the method has higher universality, the automatic operation of the hedge trimming robot can be realized, and the automation level and the intelligent level are improved. Meanwhile, an improved near-end strategy optimization (PPO) algorithm is adopted to train a deep neural network model, the complexity of manually adjusting parameters of the traditional PPO algorithm is avoided, the convergence rate of the algorithm is guaranteed, meanwhile, the performance and the generalization capability of the algorithm are improved, meanwhile, an improved Adam adaptive gradient algorithm is adopted to optimize a deep neural network target function, the learning rate is adaptively adjusted, the algorithm jumps out of a local optimal solution, the convergence rate of the algorithm is accelerated, and the generalization capability is improved.

Claims

1. An intelligent cooperative control method of a hedgerow trimming robot based on deep reinforcement learning is characterized by comprising the following steps:

the hedge trimming robot comprises a movable chassis and a trimming mechanical arm fixed on the movable chassis, and a vision detection module is mounted on the hedge trimming robot; the vision detection module comprises a hedge cross section detection camera arranged at the tail end of the trimming mechanical arm, a hedge height and distance detection camera arranged at the base of the trimming mechanical arm, and a front side lane line detection camera arranged on the moving chassis;

the intelligent cooperative control method for the hedge trimming robot comprises the following steps:

step one, establishing a Markov decision MDP deep reinforcement learning model of a hedgerow trimming robot;

step two, building a deep neural network framework;

step four, training a deep neural network by adopting an improved PPO algorithm according to a network target reward function of a maximization strategy and a network target function mean square error principle of a minimization function;

2. The intelligent cooperative control method for the hedge trimming robot according to claim 1, characterized in that:

in the MDP deep reinforcement learning model of the hedge trimming robot established in the first step, the Markov decision MDP process is described by a quintuple (S, A, P, R, gamma), wherein S represents a state set, A represents an action set, P represents a state transition probability, the value is 0 to 1, R is an incentive function, gamma is an incentive discount factor, the value is 0 to 1, and the method is used for calculating the accumulated incentive obtained in the interaction process of the agent and the environment; the intelligent agent is a vehicle-arm cooperative control module of the hedge trimming robot, and the environment comprises the hedge trimming robot, a hedge and a lane line; the strategy model of the hedge trimming robot receives the state S of the environment at the current moment_tSelecting and performing action A_tThen with probability P (S) according to the environment model_t+1|S_t，A_t) Enter a new state S_t+1And earn a reward R_t+1Policy model re-acceptance State S_t+1And R_t+1And continuously generating and executing a control instruction of the hedge trimming robot, continuously optimizing and adjusting the strategy model according to the maximum reward obtained in the process until a certain condition is met, and finishing the interaction between the intelligent agent and the environment.

3. The intelligent cooperative control method for the hedge trimming robot according to claim 2, characterized in that:

The state set includes the following eight parts: position P of center of hedgerow cross section under pixel coordinate system of hedgerow cross section detection camera_C＝(X_c，Y_c) The height H of the hedgerow, the distance L from the center of the longitudinal section of the hedgerow to the coordinate origin of the trimming mechanical arm, the position of the lane line under the coordinate system of the front lane line detection camera, and the position of the hedgerow under the coordinate system of the vehicle

Pose of each joint of trimming mechanical arm

Pose of mobile chassis under world coordinate system

Wherein, the position P of the center of the cross section of the hedge under the pixel coordinate system of the detection camera of the cross section of the hedge_C＝(X_c，Y_c) The acquisition method comprises the following steps: through the hedgerowThe method comprises the steps that a hedgerow image is shot by a cross section detection camera and feature extraction is carried out, a feature template of the cross section shape of the hedgerow is obtained, when the tail end of a trimming mechanical arm moves right above the hedgerow, the cross section shape of the hedgerow is detected and matched with the feature template, after the matching coincidence degree is larger than 80%, the cross section feature identification of the hedgerow can be considered to be effective, and a cross section center coordinate value P is output_C＝(X_c，Y_c)；

The method for acquiring the height H of the hedgerow and the distance L from the center of the longitudinal tangent plane of the hedgerow to the coordinate origin of the trimming mechanical arm comprises the following steps: the hedgerow height and distance detection camera is used for shooting hedgerow images, and the hedgerow height H and the distance L from the center of the longitudinal section of the hedgerow to the coordinate origin of the trimming mechanical arm are obtained;

The method for acquiring the position of the lane line under the coordinate system of the front lane line detection camera comprises the following steps: the front lane line detection camera takes a picture of a front lane, and obtains a left front (x) line and a right front (x) line of a lane line in a front ROI area from the picture_flf，y_flf) Left posterior (x)_flb，y_flb) Right front (x)_frf，y_frf) Right rear (x)_frb，y_frb) The coordinate values of (2).

Wherein the set of actions includes the following three parts: motion velocity component of each joint of trimming mechanical arm

Longitudinal speed of moving chassis, front wheel slip angle

wherein i is 0, 1, 2, 3; d omega_ikThe angular velocity differential of the servo motor of each joint of the trimming mechanical arm, which is output by the system, represents the running stability of the motor, a₁，b₁Is constant and a₁＜0，b₁＞0；

r_dist＝a₂(d_dist)²+c

in the formula a₂C is a constant and a₂＜0，c＞0，r_distIs the distance of the hedge from the origin of coordinates of the end effector of the trimming robot arm,

when the end effector of the trimming mechanical arm is closer to the hedgerow, the intelligent agent obtains a larger reward value;

Wherein, the reward of the tracking precision of the mobile chassis to the lane line is expressed as:

in the formula, a₃，b₂Is constant and a₃＜0，b₂＞0，K₁，K₂The number of the intelligent agent is a constant, the boundary of the coordinate range of the center line of the lane is represented, x represents the value of the center line of the lane line, and when the value of x is closer to the center line, the reward value of the intelligent agent is larger, otherwise, the reward value is smaller;

in summary, the reward function is represented as:

R＝ω₁*r_arms+ω₂*r_dist+ω₃*r_track

in the formula, omega₁，ω₂，ω₃Respectively a trimming robot systemThe weights of all parts of the reward function are set to be 0-1, and the larger the weight is, the better the performance of the trained prediction model in the aspect is.

4. The intelligent cooperative control method for the hedge trimming robot according to claim 1, characterized in that: and step two, adopting a fully-connected neural network model to respectively build a deep neural network framework of the strategy network and a deep neural network framework of the value function network.

5. The intelligent cooperative control method for the hedge trimming robot according to claim 1, characterized in that:

in step three, the policy network objective function of the improved PPO algorithm is expressed as:

in which M is

π_θAnd pi'_θRespectively representing the original strategy and the updated strategy in a single round; n is a radical of_kFor a set of trajectories obtained after executing a policy in an environment, N_k＝{τ_iR, denotes a set of state-action sequences S ₀，A₀，R₀…S_t，A_t，R_t}，

to use the merit function of generalized merit estimation (GAE), defined as

In the formulaV(S_t) In order to estimate the state cost function, lambda is an introduced hyper-parameter and takes the value of 0 to 1; e is a₀The initial truncation factor is N, the training frequency is already finished, and the set total training frequency is represented by N;

the value function network objective function of the improved PPO algorithm is expressed as:

in the formula, V_φ(S_t) Is a state cost function.

6. The intelligent cooperative control method for the hedge trimming robot according to claim 1, characterized in that:

and fifthly, the learning rate in the optimizer of the Adam adaptive gradient algorithm is represented as:

in the formula, N is the current training cycle number, N is the preset total training cycle number, and in the process of optimizing the objective function, the lower bound of the learning rate is set as the learning rate value when the objective function of the strategy network starts to rise or the learning rate value when the value function network starts to fall; the upper bound of the learning rate is set as a learning rate value when the objective function of the strategy network starts to fall or a learning rate value when the value function network starts to rise; by adaptively truncating the learning rate in a specified interval, keeping the upper bound of the learning rate unchanged in the early stage of training, continuously improving the lower bound of the learning rate, keeping the lower bound of the learning rate unchanged in the later stage of training, and continuously reducing the lower bound of the learning rate, a relatively large learning rate can be obtained in the early stage of training to enable the target function to jump out of a local optimal solution, and the learning rate is monotonically reduced in the later stage of training to ensure that the target function is monotonically converged and not diverged;

θ′＝θ+Δα