CN114771783A

CN114771783A - Control method and system for submarine stratum space robot

Info

Publication number: CN114771783A
Application number: CN202210623726.5A
Authority: CN
Inventors: 陈家旺; 林型双; 张培豪; 翁子欣; 郭进; 王荧
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-06-02
Filing date: 2022-06-02
Publication date: 2022-07-22
Anticipated expiration: 2042-06-02
Also published as: CN114771783B

Abstract

The invention relates to a control method and a control system for a submarine stratum space robot, in particular to the technical field of submarine stratum space robots. The invention sets a preset operation point before the robot starts to move, and obtains the action at the next moment according to the preset operation point, the state at the current moment and a control model at the current moment in the moving process of the robot; the states comprise attitude information, positioning information and resistance in the motion process of the submarine drilling robot and the working states of all driving hydraulic cylinders in the submarine drilling robot; the actions are used for controlling the movement of the submarine stratigraphic space robot; the actions comprise control commands of all driving hydraulic cylinders in the seabed drilling robot; the control model at the current moment is obtained by updating the control model at the last moment based on a DDPG algorithm. The invention can realize the automatic control of the submarine stratum space robot.

Description

Control method and system for submarine stratum space robot

Technical Field

The invention relates to the technical field of submarine stratum space robots, in particular to a control method and a control system of a submarine stratum space robot.

Background

With the continuous promotion of the development and utilization of oceans by human beings, the demands of human submarine stratum space for operation tasks, such as resource exploration, environmental monitoring and the like, are increasing. For these work tasks, the use of a new type of subsea stratigraphic space robot is an alternative solution. After the submarine stratum space robot is arranged in the submarine stratum through the base station, the submarine stratum space robot can freely drill and move in the submarine stratum, and the preset operation target is achieved by carrying various sensor arrays.

When the submarine stratum space robot executes tasks, the robot usually needs to move to a preset operation point after receiving a control instruction to complete the operation, but the conventional submarine stratum control field is basically manual operation control, so that the invention and the development of a proper control method and a proper control device are of great importance for realizing the automatic control of the submarine stratum space robot.

Disclosure of Invention

The invention aims to provide a control method and a control system for a submarine stratum space robot, which can realize automatic control of the submarine stratum space robot.

In order to achieve the purpose, the invention provides the following scheme:

a method of controlling a subsea stratigraphic space robot, comprising:

step 1, presetting a preset operation point;

step 2, acquiring the current time state of the submarine stratum space robot to be controlled; the states comprise attitude information, positioning information and resistance in the motion process of the submarine drilling robot and the working states of all driving hydraulic cylinders in the submarine drilling robot;

step 3, obtaining the action at the next moment according to the preset operation point, the state at the current moment and the control model at the current moment, wherein the action at the next moment is used for controlling the motion of the submarine stratum space robot; the actions comprise control instructions of each driving hydraulic cylinder in the submarine drilling robot; the control model at the current moment is obtained by updating the control model at the previous moment based on a DDPG algorithm;

and 4, repeatedly executing the step 2 to the step 3 until the seabed stratum space robot moves to the preset operation point.

Optionally, the specific training process of the initial control model is as follows:

constructing a simulation environment model; the simulation environment model comprises a submarine stratum environment simulation model and a robot simulation motion model;

setting a training preset operation point;

initializing a Critic network, an Actor network, a Critic target network and an Actor target network;

under the current iteration times, determining the current moment state of the robot simulation motion model according to the simulation environment model;

obtaining the action of the robot simulation motion model at the current moment according to the training preset operation point, the state of the robot simulation motion model at the current moment and an Actor network under the current iteration times;

the robot simulation motion model executes the action of the robot simulation motion model at the current moment to obtain behavior reward and the state of the next moment;

inputting the state of the robot simulation motion model at the next moment into an Actor target network under the current iteration times to obtain the action of the robot simulation motion model at the next moment;

inputting the state of the robot simulation motion model at the current moment and the action of the robot simulation motion model at the current moment into a Critic network under the current iteration number to obtain an estimated Q value;

obtaining an actual Q value according to the action of the robot simulation motion model at the next moment, the state of the robot simulation motion model at the next moment, the Critic target network under the current iteration number and the behavior reward;

updating the Critic network under the current iteration times and the Actor network under the current iteration times according to the actual Q value and the estimated Q value to obtain the Critic network under the next iteration times and the Actor network under the next iteration times;

and updating the Critic target network under the current iteration number and the Actor target network under the current iteration number according to the Critic network under the next iteration number and the Actor network under the next iteration number to obtain the Critic target network under the next iteration number and the Actor target network under the next iteration number, updating the iteration number to enter the next iteration until the robot simulation motion model collides or reaches the preset training operation point, and stopping the iteration until the set iteration number is reached to obtain the initial control model.

Optionally, the updating the critical network under the current iteration number and the Actor network under the current iteration number according to the actual Q value and the estimated Q value to obtain the critical network under the next iteration number and the Actor network under the next iteration number specifically includes:

obtaining a Critic network loss function under the current iteration number according to the actual Q value and the estimated Q value;

training the Critic network under the current iteration number according to the Critic network loss function under the current iteration number to obtain the Critic network under the next iteration number;

obtaining an Actor network loss function under the current iteration times according to the estimated Q value;

and updating the Actor network under the current iteration times according to the Actor network loss function under the current iteration times.

Optionally, the updating the critical target network under the current iteration number and the Actor target network under the current iteration number according to the critical network under the next iteration number and the Actor network under the next iteration number to obtain the critical target network under the next iteration and the Actor target network under the next iteration specifically includes:

updating the Critic target network under the current iteration number according to the Critic network under the next iteration number to be used as the Critic target network under the next iteration;

and updating the Actor target network under the current iteration times according to the Actor network under the next iteration times to serve as the Actor target network under the next iteration.

A control system for a subsea stratigraphic space robot, comprising:

the first preset module is used for presetting a preset operation point;

the acquisition module is used for acquiring the current time state of the submarine stratum space robot to be controlled; the states comprise attitude information, positioning information and resistance in the motion process of the submarine drilling robot and the working states of all driving hydraulic cylinders in the submarine drilling robot;

the control module is used for obtaining the action at the next moment according to the preset operation point, the state at the current moment and the control model at the current moment, and the action at the next moment is used for controlling the motion of the submarine stratum space robot; the actions include control commands for each drive hydraulic cylinder inside the subsea drilling robot; the control model at the current moment is obtained by updating the control model at the previous moment based on a DDPG algorithm;

and the execution module is used for repeatedly executing the acquisition module and the control module until the seabed stratum space robot moves to the preset operation point.

Optionally, the control system of the submarine stratum space robot further includes:

the building module is used for building a simulation environment model; the simulation environment model comprises a seabed stratum environment simulation model and a robot simulation motion model;

the second preset module is used for setting a preset training operation point;

the network initialization module is used for initializing a Critic network, an Actor network, a Critic target network and an Actor target network;

the current state determining module is used for determining the current state of the robot simulation motion model according to the simulation environment model under the current iteration times;

the current action determining module is used for obtaining the action of the robot simulation motion model at the current moment according to the preset training operation point, the state of the robot simulation motion model at the current moment and an Actor network under the current iteration times;

the next moment state determining module is used for the robot simulation motion model to execute the action of the robot simulation motion model at the current moment to obtain behavior rewards and the state of the next moment;

the next moment action determining module is used for inputting the state of the robot simulation motion model at the next moment into an Actor target network under the current iteration times to obtain the action of the robot simulation motion model at the next moment;

the estimated Q value calculation module is used for inputting the state of the robot simulation motion model at the current moment and the action of the robot simulation motion model at the current moment into a Critic network under the current iteration number to obtain an estimated Q value;

the actual Q value calculation module is used for obtaining an actual Q value according to the action of the robot simulation motion model at the next moment, the state of the robot simulation motion model at the next moment, the Critic target network under the current iteration number and the behavior reward;

the network updating module is used for updating the Critic network under the current iteration times and the Actor network under the current iteration times according to the actual Q value and the estimated Q value to obtain the Critic network under the next iteration times and the Actor network under the next iteration times;

and the target network updating module is used for updating the Critic target network under the current iteration number and the Actor target network under the current iteration number according to the Critic network under the next iteration number and the Actor network under the next iteration number to obtain the Critic target network under the next iteration and the Actor target network under the next iteration, and updating the iteration number to enter the next iteration until the robot simulation motion model collides or reaches a preset training operation point, and stopping the iteration until the set iteration number is reached to obtain the initial control model.

Optionally, the network update module specifically includes:

the Critic network loss function calculation unit is used for obtaining a Critic network loss function under the current iteration number according to the actual Q value and the estimated Q value;

the Critic network updating unit is used for training the Critic network under the current iteration times according to the Critic network loss function under the current iteration times to obtain the Critic network under the next iteration times;

the Actor network loss function calculation unit is used for obtaining an Actor network loss function under the current iteration times according to the estimated Q value;

and the Actor network updating unit is used for updating the Actor network under the current iteration times according to the Actor network loss function under the current iteration times.

Optionally, the target network updating module specifically includes:

the Critic target network updating unit is used for updating the Critic target network under the current iteration times according to the Critic network under the next iteration times to be used as the Critic target network under the next iteration;

and the Actor target network updating unit is used for updating the Actor target network under the current iteration times according to the Actor network under the next iteration times to be used as the Actor target network under the next iteration times.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the invention sets a preset operation point before the robot starts to move, and obtains the action at the next moment according to the preset operation point, the state at the current moment and a control model at the current moment in the moving process of the robot; the states comprise attitude information, positioning information and resistance in the motion process of the submarine drilling robot and the working states of all driving hydraulic cylinders in the submarine drilling robot; the actions are used to control the motion of the subsea stratigraphic space robot; the actions include control commands for each drive hydraulic cylinder inside the subsea drilling robot; the control model at the current moment is obtained by updating the control model at the previous moment based on a DDPG algorithm, and automatic control of the submarine stratum space robot can be realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a method for controlling a subsea stratigraphic space robot according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a hydraulic cylinder driving form change of a submarine stratum space robot provided by an embodiment of the invention;

fig. 3 is a flowchart of training a control model of a subsea stratigraphic space robot according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

In recent years, a control method based on reinforcement learning is widely applied to robot control and obtains good effects, so that the invention provides a control method for a submarine stratum space robot by using the motion characteristics of the submarine stratum space robot, a submarine stratum environment and a reinforcement learning algorithm, so as to solve the control problem of the robot during operation in the submarine stratum space and realize automatic control, and the control method for the submarine stratum space robot specifically comprises the following steps:

step 1, presetting a preset operation point.

Step 2, acquiring the current time state of the submarine stratum space robot to be controlled; the states comprise attitude information, positioning information and resistance in the motion process of the submarine drilling robot and the working states of all driving hydraulic cylinders in the submarine drilling robot.

Step 3, obtaining the action at the next moment according to the preset operation point, the state at the current moment and the control model at the current moment, wherein the action at the next moment is used for controlling the motion of the submarine stratum space robot; the actions comprise control instructions of each driving hydraulic cylinder in the submarine drilling robot; the control model at the current moment is obtained by updating the control model at the previous moment based on a DDPG algorithm. The control model also continuously adjusts the control model in the actual use process, and the specific process is as follows: and inputting the state of the current moment into a control model (Actor network) of the current moment to obtain the action of the next moment, and then updating the control model of the current moment by using a Critic network, a Critic target network and the Actor target network by adopting a DDPG algorithm.

And 4, repeatedly executing the steps 2 to 3 until the seabed stratum space robot moves to the preset operation point.

In practical application, the specific training process of the initial control model (the control model at time 0 in practical use) is as follows:

constructing a simulation environment model; the simulation environment model comprises a seabed stratum environment simulation model and a robot simulation motion model.

And setting a training preset operation point.

Initializing a Critic network, an Actor network, a Critic target network and an Actor target network.

And under the current iteration times, determining the current state of the robot simulation motion model according to the seabed stratum environment simulation model and the robot simulation motion model.

And obtaining the action of the robot simulation motion model at the current moment according to the training preset operation point, the state of the robot simulation motion model at the current moment and the Actor network under the current iteration times.

And the robot simulation motion model executes the action of the robot simulation motion model at the current moment to obtain a behavior reward and the state at the next moment.

And inputting the state of the robot simulation motion model at the next moment into an Actor target network under the current iteration times to obtain the action of the robot simulation motion model at the next moment.

And inputting the state of the robot simulation motion model at the current moment and the action of the robot simulation motion model at the current moment into a Critic network under the current iteration number to obtain an estimated Q value.

And obtaining an actual Q value according to the action of the robot simulation motion model at the next moment, the state of the robot simulation motion model at the next moment, the Critic target network under the current iteration number and the behavior reward.

And updating the Critic network under the current iteration number and the Actor network under the current iteration number according to the actual Q value and the estimated Q value to obtain the Critic network under the next iteration number and the Actor network under the next iteration number.

And updating the Critic target network under the current iteration number and the Actor target network under the current iteration number according to the Critic network under the next iteration number and the Actor network under the next iteration number to obtain the Critic target network under the next iteration number and the Actor target network under the next iteration number, updating the iteration number to enter the next iteration until the robot simulation motion model collides or a preset training operation point is reached, and stopping the iteration until a set iteration number is reached to obtain the initial control model.

In practical application, the updating the Critic network under the current iteration number and the Actor network under the current iteration number according to the actual Q value and the estimated Q value to obtain the Critic network under the next iteration number and the Actor network under the next iteration number specifically includes:

and obtaining a Critic network loss function under the current iteration number according to the actual Q value and the estimated Q value.

And training the Critic network under the current iteration number according to the Critic network loss function under the current iteration number to obtain the Critic network under the next iteration number.

And obtaining an Actor network loss function under the current iteration number according to the estimated Q value.

In practical application, the updating the critical target network in the current iteration count and the Actor target network in the current iteration count according to the critical network in the next iteration count and the Actor network in the next iteration count to obtain the critical target network in the next iteration and the Actor target network in the next iteration specifically includes:

and updating the Critic target network under the current iteration times according to the Critic network under the next iteration times to serve as the Critic target network under the next iteration.

As shown in fig. 1, the embodiment of the present invention provides a more specific control method for a submarine stratigraphic space robot based on the motion characteristics of the submarine stratigraphic space robot, the submarine stratigraphic environment and the reinforcement learning algorithm, which specifically comprises the following steps:

s101, defining the task target, state, action and reward function of the submarine stratum space robot.

S102, arranging a sensor array on the body of the submarine stratum space robot, and collecting state parameters of the robot in the motion process.

And S103, building a control model of the seabed stratum space robot based on reinforcement learning.

And S104, learning a training control model, and applying the training control model to an actual operation task of the submarine stratum space robot.

In practical applications, the task objectives are: by inputting a preset operation point, the robot plans to move to the preset operation point autonomously, and meanwhile, the path of the seabed stratum space robot moving to the preset operation point is an optimal path. The optimal path is that the obstacle area should be avoided during the movement of the seabed stratum space robot, and the length of the path should be shortest. The obstructed areas are areas of the formation that are difficult to drill through during movement.

The states specifically refer to attitude information, positioning information, blocked condition in the drilling process of the submarine stratum space robot and the working states of all driving hydraulic cylinders in the robot in the moving process of the robot.

Based on the definition of the attitude information, the positioning information and the obstructed information in the drilling process of the submarine stratum space robot, the state of the submarine stratum space robot can be defined as follows:

s＝[x,y,z,ψ,θ,φ,f₁,…,f_n,h₁,…,h_m]

attitude information of the seabed stratum space robot is expressed by adopting a space rotation Euler angle, and concretely, psi, theta and phi are defined as rotation angles around a Z axis, a Y axis and an X axis respectively and are divided into a yaw angle, a pitch angle and a roll angle.

The positioning information of the submarine stratum space robot is expressed by adopting a space coordinate value, specifically, a northeast coordinate system is established by taking a release point of the submarine stratum space robot as a coordinate origin, and X, Y and Z are defined as coordinate values of the submarine stratum space robot on an X axis, a Y axis and a Z axis, namely the positioning information.

The obstruction information of the submarine stratum space robot in the drilling process is specifically the robot resistance value measured at a specific point on the robot body, and is defined as f_iAnd (i is more than or equal to 1 and less than or equal to n), wherein n represents the number of the specific points defined on the body.

The working state of each driving hydraulic cylinder in the submarine stratum space robot is defined as h_i(i is more than or equal to 1 and less than or equal to m), wherein m represents the number of each driving hydraulic cylinder in the robot. The working states of the driving hydraulic cylinders in the submarine stratum space robot specifically comprise an open-close state and an oil pressure stateAnd flow status, each defined as o_i、p_iAnd u_i. Namely, the following steps are included:

h_i＝[o_i,p_i,u_i]

the action of the submarine stratum space robot is a control instruction for the submarine stratum space robot, the submarine stratum space robot is integrally driven by the combination of the plurality of hydraulic cylinders, and the submarine stratum space robot is converted into a specific motion form by the combined control of the plurality of hydraulic cylinders, so that the specific drilling speed and direction of the submarine stratum space robot can be controlled. The control of the hydraulic cylinder comprises the control of the opening and closing state of the hydraulic cylinder, the control of the hydraulic state of the hydraulic cylinder and the control of the flow state of the hydraulic cylinder. Specifically, the actions may be represented as:

a＝[h₁,…,h_m]

the subsea stratigraphic space robot is a multi-body segment structure including, but not limited to, a kinematic support body segment, a steering body segment, a propulsion body segment, and a drilling body segment; fig. 2 is a schematic diagram showing the variation of the hydraulic cylinder driving form of the submarine stratum space robot.

When the internal hydraulic cylinder of the support body section is not displaced, the shape is shown as part (a) -1 in fig. 2, and the support body section can play a role in supporting drilling movement. After the internal hydraulic cylinders of the support body sections are displaced, the shape is shown as (a) -2 part in fig. 2, so that the movement is convenient.

When the displacement of the internal hydraulic cylinder of the propulsion body section is increased, the shape is shown as (b) -1 part in figure 2, and the support body section can be matched to play a role in propelling the front movement of the robot. When the displacement of the internal hydraulic cylinder of the propulsion body segment is reduced, the shape is shown as (b) -2 part in figure 2, and the matching support body segment can play a role in pulling the rear part of the robot to move.

The four hydraulic cylinders are arranged in the steering body knuckle, when the four hydraulic cylinders do not displace, the form of the steering body knuckle is shown as a part (c) -1 in fig. 2, when the four hydraulic cylinders displace in different lengths, displacement difference is formed among the hydraulic cylinders, and steering of the steering body knuckle can be achieved as shown as a part (c) -2 in fig. 2.

The reward function specifically comprises a control behavior reward, an obstacle avoidance behavior reward and a target achievement degree reward.

The control behavior reward is reward and punishment evaluation on the control behavior in the motion process of the submarine stratum space robot, and is defined as r1, and the reward and punishment evaluation is specifically as shown in the following formula,

which is indicative of the steering angle, is,

when the robot drills straight, the robot is given the maximum positive reward; as the robot makes steering drilling, the positive reward given will decrease as the steering angle increases.

The obstacle avoidance behavior reward refers to reward and punishment evaluation on obstacle avoidance behaviors of the submarine stratum space robot, and is defined as r2, and the obstacle avoidance behavior reward is specifically expressed as the following formula. When the robot is far away from the obstacle area, a positive reward will be given; when the robot approaches the obstacle area, a negative reward will be given; when the robot enters an obstructed area, the robot will be given a large negative reward.

The target achievement degree reward is reward and punishment evaluation on the achievement degree of the operation task of the submarine stratum space robot, is defined as r3, and is given negative reward when the robot is far away from a preset operation point; when the robot approaches the predetermined operating point, a positive reward will be given; when the robot reaches the predetermined operating point, the robot will be awarded a larger positive prize.

Based on the definitions of the control behavior reward, the obstacle avoidance behavior reward and the target achievement degree reward of the submarine stratum space robot, a reward function r of the submarine stratum space robot can be defined, and the reward function r is specifically as follows. Wherein k is_iRepresenting the bonus factor.

In practical applications, the sensor array includes a positioning sensor, an attitude sensor, and a plurality of resistance measurement sensors. And the state parameter of the robot in the motion process is s, wherein the positioning information of the seabed stratum space robot is measured by a positioning sensor in the sensor array, the attitude information is measured by an attitude sensor of the seabed stratum space robot, and the obstruction information in the drilling process is measured by a resistance sensor. By calculating the Euclidean distance between the positioning information measured by the positioning sensor and the preset operation point, the position relation between the submarine stratum space robot and the preset operation point can be judged to be far away, close or arriving. The attitude information of the submarine stratum space robot measured by the attitude sensor can be used for judging whether the submarine stratum space robot moves straight or not in the moving process or not, or judging the specific steering angle.

By calculating the change condition of the accumulated value of the resistance information of each resistance sensor, the change of the blocked condition in the whole drilling process of the robot can be judged, and the relation between the change and the blocked area is far away or close. By defining a resistance threshold value, whether the robot enters an obstacle area can be judged. Defining the resistance threshold as epsilon, when f_i>And when the epsilon (1 is more than or equal to i and less than or equal to n), whether the robot enters the obstacle area is shown. Otherwise, not.

In practical application, the third step is specifically:

the control model of the seabed stratum space robot based on reinforcement learning is based on DDPG algorithm. The DDPG algorithm is additionally provided with a strategy network on the basis of the DQN algorithm to output the action value of the seabed stratum space robot, and simultaneously learns the Q network and the strategy network. The DDPG algorithm is an Actor-Critic structure.

The DDPG algorithm construction specifically comprises Q network construction, strategy network construction and target network construction. The Q network is a criticic network, and the strategy network is an Actor network. The Actor network inputs the state (state) of the environment at the present time and outputs a control command for each driving hydraulic cylinder of the robot.

The weight of the Critic network is defined as theta^QI.e. the Critic network may be denoted as Q(s)_t,a_t|θ^Q) (ii) a The weight of an Actor network is defined as θ^μI.e. the Actor network may be denoted as mu(s)_t|θ^μ) (ii) a The target network initial weight is copied from a Critic network and an Actor network and specifically comprises two parts, and the weight is specifically defined as theta^Q'And theta^μ'Each can be represented as Q^'(s_t+1,a_t+1|θ^Q') And mu^'(s_t|θ^μ')。

The loss of the training process of the Critic network is calculated by the actual Q value calculated by the target network and the Q value calculated by the target network.

The actual Q value calculated by the target network is as follows:

y_t＝r_t+γQ^'(s_t+1,a_t+1|θ^Q')

the Critic network loss function is calculated as follows:

in the Actor network, the Q value fed back through the Critic network is used as a loss function of model training, and the following formula is specifically given:

Loss_Actor＝-Q(s_t,a_t|θ^Q)

and then completing updating of the Actor network by a gradient descent method.

The Actor network inputs the state of the environment at the current moment and outputs a control instruction for each driving hydraulic cylinder of the robot, namely according to a_t＝μ(s_t|θ^μ) In the case of a liquid crystal display device, in particular,

a_t＝[h_1t,…,h_mt]＝μ(s_t|θ^μ)＝μ(x_t,y_t,z_t,ψ_t,θ_t,φ_t,f_1t,…,f_nt,h_1t,…,h_mt|θ^μ)

step four, specifically: as shown in fig. 3, the training of the control model specifically includes the following steps:

and S1, establishing a simulation environment model of the seabed stratum and a simulation motion model of the robot by using a computing mechanism, and setting a release point and a preset operation point of a robot base station.

S2, initializing Critic network Q, Actor network mu and initializing target network Q^'And mu^'And setting the training round of the model.

S3, each training round includes a plurality of time steps. In each training time step, the method specifically comprises the following steps

(1) According to the state s of the environment in the current time step_tThe Actor network outputs an action a to be executed by the submarine stratigraphic space robot_t。

(2) Subsea stratigraphic space robotic execution a_tThe robot gets feedback r_tState s of_tIs converted into s_t+1。

(3) The Critic network calculates the estimated Q value, i.e. Q(s)_t,a_t|θ^Q) The target network calculates the actual Q value, which is y_t。

(4) Computing Loss_ActorAnd Loss_CriticAnd updating the weights of the Critic network and the Actor network.

(5) The target network weights are updated and, in particular,

θ^Q'←τθ^Q+(1-τ)θ^Q'，θ^μ'←τθ^μ+(1-τ)θ^μ'。

after the learning training of the control model of the submarine stratum space robot is completed, the method can be applied to the actual operation task of the submarine stratum space robot.

The embodiment of the invention also provides a control system of the submarine stratum space robot, which aims at the method, and comprises the following steps:

the first preset module is used for presetting a preset operation point.

The acquisition module is used for acquiring the current time state of the submarine stratum space robot to be controlled; the states comprise attitude information, positioning information and resistance in the motion process of the submarine drilling robot and the working states of all driving hydraulic cylinders in the submarine drilling robot.

The control module is used for obtaining the action at the next moment according to the preset operation point, the state at the current moment and the control model at the current moment, and the action at the next moment is used for controlling the motion of the submarine stratum space robot; the actions include control commands for each drive hydraulic cylinder inside the subsea drilling robot; the control model at the current moment is obtained by updating the control model at the last moment based on a DDPG algorithm.

And the execution module is used for repeatedly executing the acquisition module and the control module until the submarine stratum space robot moves to the preset operation point.

As an optional implementation manner, the control system of the subsea stratigraphic space robot further comprises:

the building module is used for building a simulation environment model; the simulation environment model comprises a submarine stratum environment simulation model and a robot simulation motion model.

And the second preset module is used for setting a preset training operation point.

And the network initialization module is used for initializing a Critic network, an Actor network, a Critic target network and an Actor target network.

And the current state determining module is used for determining the current state of the robot simulation motion model according to the simulation environment model under the current iteration times.

And the current action determining module is used for obtaining the action of the robot simulation motion model at the current moment according to the preset training operation point, the state of the robot simulation motion model at the current moment and an Actor network under the current iteration times.

And the next moment state determining module is used for the robot simulation motion model to execute the action of the robot simulation motion model at the current moment so as to obtain behavior rewards and the state of the next moment.

And the next moment action determining module is used for inputting the state of the robot simulation motion model at the next moment into the Actor target network under the current iteration times to obtain the action of the robot simulation motion model at the next moment.

And the estimated Q value calculation module is used for inputting the state of the robot simulation motion model at the current moment and the action of the robot simulation motion model at the current moment into the Critic network under the current iteration number to obtain an estimated Q value.

And the actual Q value calculation module is used for obtaining an actual Q value according to the action of the robot simulation motion model at the next moment, the state of the robot simulation motion model at the next moment, the Critic target network under the current iteration number and the behavior reward.

And the network updating module is used for updating the Critic network under the current iteration times and the Actor network under the current iteration times according to the actual Q value and the estimated Q value to obtain the Critic network under the next iteration times and the Actor network under the next iteration times.

And the target network updating module is used for updating the Critic target network under the current iteration number and the Actor target network under the current iteration number according to the Critic network under the next iteration number and the Actor network under the next iteration number to obtain the Critic target network under the next iteration and the Actor target network under the next iteration number, updating the iteration number to enter the next iteration until the robot simulation motion model collides or reaches a preset training operation point, and stopping the iteration until the set iteration number is reached to obtain the initial control model.

As an optional implementation manner, the network update module specifically includes:

and the Critic network loss function calculation unit is used for obtaining the Critic network loss function under the current iteration number according to the actual Q value and the estimated Q value.

And the Critic network updating unit is used for training the Critic network under the current iteration number according to the Critic network loss function under the current iteration number to obtain the Critic network under the next iteration number.

And the Actor network loss function calculating unit is used for obtaining the Actor network loss function under the current iteration times according to the estimated Q value.

As an optional implementation manner, the target network updating module specifically includes:

and the Critic target network updating unit is used for updating the Critic target network under the current iteration times according to the Critic network under the next iteration times to be used as the Critic target network under the next iteration.

The invention has the following technical effects:

(1) according to the control method of the seabed stratum space robot based on reinforcement learning, the robot plans to move to a preset operation point automatically by inputting the preset operation point. The autonomous planned obstacle avoidance motion path can enable the seabed stratum space robot to avoid an obstacle area, and the length of the path is the shortest.

(2) The control method provided by the invention is based on the motion characteristics of the submarine stratum space robot, submarine stratum environment and reinforcement learning algorithm, and can obtain excellent technical effect when controlling the motion of the submarine stratum space robot.

(3) The control method of the submarine stratum space robot provided by the invention can also control various other types of robots for submarine stratum space operation, and has good transportability.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A method of controlling a subsea stratigraphic space robot, comprising:

step 1, presetting a preset operation point;

2. The control method of the submarine stratum space robot according to claim 1, wherein the specific training process of the initial control model is as follows:

setting a training preset operation point;

under the current iteration times, determining the current state of the robot simulation motion model according to the simulation environment model;

3. The method for controlling a submarine stratum space robot according to claim 2, wherein the step of updating the Critic network for the current iteration count and the Actor network for the current iteration count according to the actual Q value and the estimated Q value to obtain the Critic network for the next iteration count and the Actor network for the next iteration count specifically comprises:

training the Critic network under the current iteration times according to the Critic network loss function under the current iteration times to obtain the Critic network under the next iteration times;

4. The method for controlling a submarine stratigraphic space robot according to claim 2, wherein the updating of the Critic target network under the current iteration number and the Actor target network under the current iteration number according to the Critic network under the next iteration number and the Actor network under the next iteration number to obtain the Critic target network under the next iteration and the Actor target network under the next iteration specifically comprises:

updating the Critic target network under the current iteration times according to the Critic network under the next iteration times to be used as the Critic target network under the next iteration;

5. A control system for a subsea stratigraphic space robot, comprising:

the presetting module is used for presetting a preset operation point;

the control module is used for obtaining the action at the next moment according to the preset operation point, the state at the current moment and the control model at the current moment, and the action at the next moment is used for controlling the motion of the submarine stratum space robot; the actions include control commands for each drive hydraulic cylinder inside the subsea drilling robot; the control model at the current moment is obtained by updating the control model at the last moment based on a DDPG algorithm;

6. The control system of a subsea stratigraphic space robot according to claim 5, characterized by further comprising:

the second preset module is used for setting a preset training operation point;

the next moment state determining module is used for the robot simulation motion model to execute the action of the robot simulation motion model at the current moment so as to obtain behavior reward and the state of the next moment;

7. The control system of a subsea stratigraphic space robot according to claim 6, characterized in that said network update module comprises in particular:

8. The control system of a subsea stratigraphic space robot according to claim 6, characterized in that said target network update module comprises in particular: