CN116734850A

CN116734850A - Unmanned platform reinforcement learning autonomous navigation system and method based on visual input

Info

Publication number: CN116734850A
Application number: CN202310458355.4A
Authority: CN
Inventors: 李震; 白正琨; 陈振; 刘向东
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-09-12

Abstract

The invention belongs to the technical field of unmanned platform autonomous navigation, and particularly relates to an unmanned platform reinforcement learning autonomous navigation system and method based on visual input. The invention designs a complete unmanned platform autonomous navigation system based on visual input, which is realized from environmental data acquisition, environmental feature extraction to path planning algorithm, and designs a unique motion state feature extraction network and a visual feature extraction network, so that a reinforcement learning decision network can obtain comprehensive and rich state information to promote network learning; meanwhile, the reinforcement learning navigation algorithm based on action adjustment, which is designed by the invention, leads the heuristic controller and the action adjustment network into the baseline PPO algorithm together, can help reinforcement learning training to converge more quickly, avoid sinking into local optimum and even not converging, and improve the overall efficiency of the algorithm; in general, the navigation system provided by the invention can perform autonomous navigation without environment priori information, and has generalization and mobility.

Description

Unmanned platform reinforcement learning autonomous navigation system and method based on visual input

Technical Field

The invention belongs to the technical field of unmanned platform autonomous navigation, and particularly relates to an unmanned platform reinforcement learning autonomous navigation system and method based on visual input.

Background

The navigation technology is the basis of various applications of the unmanned platform, and only the navigation problem is solved, the unmanned platform can execute more complex tasks with higher requirements. For the unmanned platform, the navigation problem can be briefly described as that the unmanned platform starts from the current position, and a collision-free path is planned to move to the target position by sensing the environment. The navigation task is heavy, and the information which the unmanned platform needs to perceive from the environment is mainly the position of the unmanned platform in the environment. The global positioning system (Global Positioning System, GPS) can provide positional information of an unmanned platform, but is low in positioning accuracy due to weak signals indoors, and cannot be used as an input of a navigation algorithm. In addition, conventional unmanned platform path planning methods are commonly based on path planning on a priori maps, such as search algorithms including depth-first search (Depth First Search, DFS) and breadth-first search (Breadth First Search, BFS). The search algorithm searches for a traversable path from the start point to the target point by means of known map information and obstacle information in the environment. These algorithms require prior knowledge of the environmental information to work and are difficult to work against complex scenes.

The binocular vision-based positioning technology can provide the position of the unmanned platform and map the environment simultaneously under the condition of no environment priori information, and can complete navigation on the basis. The method can be used in environments with weak GPS signals or map information deficiency conditions such as indoor environments, is a current research hot spot, and is already applied to the fields of household robots, unmanned robots and the like. The remarkable ability of machine learning in visual perception provides tremendous application potential for unmanned platform navigation. In the unmanned platform indoor navigation, the image perception information is processed through the neural network, the image characteristics are extracted, the unmanned platform can be helped to recognize the environment more deeply, and the navigation performance is improved. The navigation problem can be abstracted into a sequence decision problem, i.e. the unmanned platform goes from the current state to the next state by performing an action, and the process is repeated from beginning until the target point is reached. Such problems can be solved by reinforcement learning (Reinforcement Learning, RL).

Therefore, reinforcement learning based on visual input is applied to the unmanned platform navigation task, so that the unmanned platform can perform exploration and interaction with the environment autonomously and make decisions without human intervention to complete navigation. On the other hand, the unmanned platform can adapt to more complex environments, and the task capacity of the unmanned platform in different environments is increased.

Disclosure of Invention

The technical solution of the invention is as follows: the unmanned platform reinforcement learning autonomous navigation system and the method based on visual input are provided, and the system and the method are a complete set of system and method from acquisition of unmanned platform position information, extraction of sensor data characteristics to completion of reinforcement learning path planning navigation and are used for simulating indoor navigation tasks.

The technical scheme of the invention is as follows: an unmanned platform reinforcement learning autonomous navigation system based on visual input.

The autonomous navigation system comprises an environment simulation module, a visual perception module and a reinforcement learning module;

the environment simulation module is used for outputting speed information, RGB image information and depth image information of the simulation unmanned platform to the reinforcement learning module; the visual perception module is used for outputting visual images of the binocular camera to the visual perception module;

the visual perception module is used for receiving the visual image of the binocular camera output by the environment simulation module, obtaining the relative position of the unmanned platform under the world coordinate system according to the received visual image of the binocular camera, and outputting the relative position to the reinforcement learning module;

the reinforcement learning module is used for receiving the speed information, the RGB image information and the depth image information of the unmanned platform output by the environment simulation module and the relative position of the unmanned platform under the world coordinate system output by the visual perception module; and the reinforcement learning module outputs actions of the unmanned platform to the unmanned platform in the environment simulation module according to the received speed information, RGB image information, depth image information and relative positions of the unmanned platform under the world coordinate system.

Specifically, the environment simulation module is a simulation environment formed by a UE4 engine and an AirSim plug-in, wherein the UE4 engine is responsible for building and rendering a simulation environment required by actions of an unmanned platform, and the AirSim is responsible for introducing simulation models of the unmanned platform such as four rotors, unmanned vehicles and the like; the environment simulation module provides a sensor interface and a control interface of the unmanned platform, the environment simulation module provides speed information, RGB image information and aligned depth image information of the unmanned platform to the reinforcement learning module through the sensor interface, and control signals of actions of the unmanned platform are transmitted to the unmanned platform in a simulation environment through the control interface of the environment simulation module, so that simulation actions of the unmanned platform are completed; visual images of the unmanned platform binocular camera in the simulation environment are transmitted into the visual perception module through topics of ROS.

Specifically, the visual perception module firstly carries out image preprocessing on the received visual image of the binocular camera, then extracts characteristic points of the preprocessed visual image, finally inputs the extracted characteristic points of the visual image into the visual odometer for calculation to obtain the real-time position of the unmanned platform, and simultaneously corrects the obtained real-time position of the unmanned platform by using the local map,

Specifically, the visual perception module is built based on ROS, relative position information of the unmanned platform under the world coordinate system, calculated by the visual perception module, is released to the reinforcement learning module through ROS topics to serve as a part of navigation algorithm input to make a decision, and meanwhile, the visual perception module establishes a service end based on an ROS service mechanism and is used for triggering a reset function in a simulation process, namely, zero setting of a departure point, map clearing and key frame clearing operation.

The method for the visual perception module to obtain the relative position of the unmanned platform under the world coordinate system according to the received visual image of the binocular camera comprises the following steps:

step S1: acquiring a visual image of the binocular camera through the ROS;

step S2: preprocessing the visual image;

step S3: inputting the preprocessed visual image into a visual odometer for calculation to obtain the preliminary real-time position of the unmanned platform

Step S4: updating a local map and optimizing preliminary real-time location by information of the local map

Step S5: transmitting the optimized real-time position to the reinforcement learning module through ROS

Specifically, the reinforcement learning module consists of a state feature extraction module responsible for extracting state features and a deep reinforcement learning decision module responsible for path planning; when the reinforcement learning training needs to be restarted, the reinforcement learning module serves as a client of the ROS service to send a reset request to a service end of the visual perception module, and the reinforcement learning module and the visual perception module are reset at the same time.

The state feature extraction module receives and processes original information from the environment simulation module and the visual perception module, wherein the original information comprises RGB image and depth image information given by the environment simulation module, speed information of the unmanned platform, real-time position information of the unmanned platform given by the visual perception module and time sequence information generated in the navigation training process; the state feature extraction module is used for respectively processing and carrying out feature aggregation on the information to obtain final unmanned platform state features and outputting the final unmanned platform state features to the deep reinforcement learning decision module;

the deep reinforcement learning decision module is used for receiving the unmanned platform state characteristics output by the state characteristic extraction module, outputting the optimal actions of the unmanned platform under the current state characteristics according to the received unmanned platform state characteristics, and transmitting the optimal actions to the unmanned platform in the environment simulation module for execution. Important parts in the deep reinforcement learning decision module include action set design, reward function design and algorithm architecture design.

The state feature extraction module in the reinforcement learning module consists of a vision processing sub-module, a motion information processing sub-module and a time sequence feature extraction sub-module;

The visual information processing sub-module is used for receiving the 3-channel RGB image and the aligned depth image of the unmanned platform output by the environment simulation module; finally obtaining the visual characteristics of the unmanned platform after processing;

the motion information processing sub-module receives the real-time position of the unmanned platform calculated by the visual perception module and the speed information of the unmanned platform output by the environment simulation module; finally obtaining the motion characteristics of the unmanned platform after processing;

the time sequence feature extraction sub-module combines action information and rewarding information of the previous period of reinforcement learning on the basis of visual features and motion features of the unmanned platform to obtain final unmanned platform state features;

the method for processing the received original information by the state feature extraction module comprises the following steps:

step S1: the visual information processing sub-module receives the RGB image and the depth image transmitted from the environment simulation module;

step S2: preprocessing the RGB image according to the sequence of range scaling, size adjustment and standardization;

step S3: preprocessing the depth image according to the sequence of range scaling, size adjustment and standardization;

Step S4: extracting 1×512 RGB features from the preprocessed RGB image by using a ResNet-50 network;

step S5: extracting 1X 512 depth features from the preprocessed depth image by adopting a ResNet-50 network;

step S6: performing feature aggregation on the RGB features and the depth features to obtain 1 multiplied by 1024 features;

step S7: mapping the 1X 1024 features to 1X 512 dimensions through the full connection layer to obtain the visual features of the unmanned platform;

step S9: the motion information processing sub-module receives the real-time position of the unmanned platform calculated by the visual perception module and the speed information of the unmanned platform output by the environment simulation module;

step S10: calculating real-time position information of the unmanned aerial vehicle platform and navigation target point information to obtain relative position information of the unmanned aerial vehicle platform relative to the navigation target point under the NED coordinate system;

step S11: calculating the linear speed of the unmanned platform under the unmanned platform body coordinate system and the relative position of the unmanned platform relative to the navigation target point under the unmanned platform body coordinate system;

step S12: performing feature mapping on the linear speed and the relative position information through a multi-layer perceptron, and mapping from 6 dimensions to 128 dimensions to obtain the motion features of the unmanned platform;

step S13: the time sequence feature extraction submodule receives the visual feature with the size of 1 multiplied by 512 output by the visual feature extraction submodule and the motion feature with the size of 1 multiplied by 128 output by the motion feature extraction submodule;

Step S14: the visual features and the motion features are aggregated to obtain a vector with the size of 1 multiplied by 640;

step S15: mapping the vector of 1 multiplied by 640 obtained in the step S14 to 1 multiplied by 512 by a full connection layer to obtain the motion state characteristics of the current state of the unmanned platform;

step S16: the state characteristics of the unmanned platform at the current moment, the last period action information and the last period rewarding information are input into an LSTM network together for extraction, and finally the state characteristics of the unmanned platform of 1 multiplied by 512 are obtained;

the deep reinforcement learning decision module outputs corresponding unmanned platform actions after making decisions according to the final state characteristics of 1 multiplied by 512;

the action set in the deep reinforcement learning decision module comprises a front-back left-right unidirectional motion, a left front oblique motion, a right front oblique motion, a counterclockwise rotation and a clockwise rotation, and a total of eight discrete action sets of actions;

the reward function in the deep reinforcement learning decision module, namely, the reward function r in the reinforcement learning PPO controller algorithm is displayed with the reward r _f Collision reward r _c Step prize r _s Distance rewards r _d Four parts;

the algorithm architecture in the deep reinforcement learning decision module is designed in such a way that the current state of the unmanned platform is simultaneously input into the PPO network and the heuristic controller, the PPO network and the heuristic controller respectively output reinforcement learning actions and the heuristic actions, the action adjustment network makes decisions according to the current state, reinforcement learning actions or heuristic actions are executed, an intelligent agent executes the decisions, the actually executed actions and the decisions are sent into an experience pool, and then the PPO network and the action adjustment network are updated;

The algorithm steps of the deep reinforcement learning decision module are as follows:

step S1: initializing actions to adjust network parameters

Step S2: initializing PPO policy network parametersAnd a cost function network parameter phi ₀ ；

Step S3: initializing an experience playback pool R;

step S4: resetting the environment and unmanned platform states, and resetting the experience playback pool;

step S5: obtaining an initial state s of the unmanned platform ₀ And target point position p _d ；

Step S6: sampling PPO policy network output action a _t ⁰ ；

Step S7: sampling heuristic controller output actions

Step S8: sampling action adjustment network output action a _t ∈{a _t ⁰ ,a _t ¹ }；

Step S9: executing action a _t Obtain the prize r _t+1 And next state s _t+1 ；

Step S10: judging S _t+1 -p _d If it approaches 0, let d _t+1 =1; if the value of the current does not approach 0,let d _t+1 ＝0；

Step S11: will(s) _t ,a _t ,r _t+1 ,s _t+1 ,d _t+1 ) Storing into R;

step S12: judging whether the current moment is T-1 or not, and executing a step S13 if the current moment is T-1; if not, executing step S6;

step S13: randomly sampling a sample of a miniband size from R;

step S14: calculating cumulative discount revenuesCalculating dominance function->

Step S15: updating the PPO network and updating the action adjustment network;

step S16: judging whether the current sampling number is M, if so, executing a step S17; if not, executing step S13;

Step S17: judging whether the K-th screen is reached at present, if so, ending the algorithm, and if not, executing the step S4.

An unmanned platform reinforcement learning autonomous navigation method based on visual input, comprising the following steps:

the method comprises the steps that firstly, an environment simulation module outputs speed information, RGB image information and depth image information of a simulation unmanned platform to a reinforcement learning module;

the second step, the environment simulation module outputs the visual image of the binocular camera to the visual perception module;

thirdly, the visual perception module receives the visual image of the binocular camera output by the environment simulation module, obtains the relative position of the unmanned platform under the world coordinate system according to the received visual image of the binocular camera, and outputs the relative position to the reinforcement learning module;

and fourthly, the reinforcement learning module receives the speed information, the RGB image information and the depth image information of the unmanned platform output by the environment simulation module and the relative position of the unmanned platform under the world coordinate system output by the visual perception module, and outputs the actions of the unmanned platform to the unmanned platform in the environment simulation module according to the received speed information, the RGB image information, the depth image information and the relative position of the unmanned platform under the world coordinate system.

Advantageous effects

(1) The invention realizes a complete unmanned platform autonomous navigation system based on visual input, which is realized from environmental data acquisition, environmental characteristic extraction to path planning algorithm;

(2) The positioning based on visual input, namely the visual perception module, is more suitable for the situation that an unmanned platform realizes positioning through vision in the actual situation, and is more suitable for migration to the actual experiment instead of directly acquiring a position true value from an interface of a simulation environment;

(3) The state feature extraction module is provided with the motion state feature extraction network and the visual feature extraction network, so that the reinforcement learning decision network can obtain comprehensive and rich state information to promote network learning;

(4) The reinforcement learning navigation algorithm based on action adjustment in the deep reinforcement learning decision module of the invention introduces the heuristic controller and the action adjustment network into the baseline PPO algorithm together, can help reinforcement learning training to converge more quickly, avoid sinking into local optimum or even not converging, and improve the overall efficiency of the algorithm;

(5) The system can perform autonomous navigation without environment priori information, and has certain generalization and mobility.

(6) The invention belongs to the technical field of unmanned platform autonomous navigation, and particularly relates to an unmanned platform reinforcement learning autonomous navigation system and method based on visual input. The invention designs a complete unmanned platform autonomous navigation system realized by a path planning algorithm from environmental data acquisition and environmental feature extraction based on visual input, wherein the positioning based on the visual input is more in line with the situation that the unmanned platform realizes positioning through vision in the actual situation, and the unmanned platform is not directly used for acquiring a position true value from an interface of a simulation environment, so that the unmanned platform autonomous navigation system is more suitable for being migrated to the actual experiment; in addition, the invention designs a unique motion state feature extraction network and a visual feature extraction network, so that the reinforcement learning decision network can obtain comprehensive and rich state information to promote network learning; meanwhile, the reinforcement learning navigation algorithm based on action adjustment, which is designed by the invention, leads the heuristic controller and the action adjustment network into the baseline PPO algorithm together, can help reinforcement learning training to converge more quickly, avoid sinking into local optimum and even not converging, and improve the overall efficiency of the algorithm; in general, the navigation system provided by the invention can perform autonomous navigation under the condition of no environment priori information, and has certain generalization and mobility.

Drawings

FIG. 1 is a diagram of a reinforcement learning navigation system framework based on visual input in accordance with the present application;

FIG. 2 is a schematic diagram of a reinforcement learning module shown in FIG. 1 according to the present application;

FIG. 3 is a more specific embodiment of the reinforcement learning module of FIG. 1 of the present application;

FIG. 4 is a specific flow of the "visual information processing sub-module" of FIG. 3;

FIG. 5 is a specific flow of the "motion information processing submodule" of FIG. 3;

FIG. 6 is a specific flow of the "timing feature extraction submodule" in FIG. 3;

FIG. 7 is a specific flow of the "deep reinforcement learning decision module" of FIG. 3;

FIG. 8 is an explanatory diagram of the action set design in the "deep reinforcement learning decision module" of FIG. 2;

fig. 9 is a specific flow of the "action adjusting network" in fig. 2.

Detailed Description

For a clearer and more complete description of the objects, technical solutions and advantages of the present application, reference will be made to the following detailed description taken in conjunction with the accompanying drawings and examples.

As shown in fig. 1, an unmanned platform reinforcement learning autonomous navigation system based on visual input comprises an environment simulation module, a visual perception module and a reinforcement learning module; in the figure, solid arrows represent the information transfer direction, and elliptical shapes represent the information content to be transferred.

The environment simulation module is a simulation environment formed by a UE4 engine and an AirSim plug-in, wherein the UE4 engine is responsible for building and rendering a simulation environment required by the action of the unmanned platform, and the AirSim is responsible for introducing simulation models of the unmanned platform such as four rotors, unmanned vehicles and the like; the environment simulation module provides a sensor interface and a control interface of the unmanned platform, the environment simulation module provides speed information, RGB image information and aligned depth image information of the simulation unmanned platform for the reinforcement learning module through a sensor interface (APIs), and meanwhile, the reinforcement learning module transmits a control signal of the unmanned platform action to the unmanned platform in the simulation environment through the control interface (APIs) of the environment simulation module after the decision is completed, so that the unmanned platform simulation action is completed; the simulation unmanned platform in the environment simulation module is used as a working Node (Node) of the ROS to output visual images of the binocular camera, and the visual images of the binocular camera are transmitted into the visual perception module through topics of the ROS;

the visual perception module is built based on the ROS, and is used for receiving the visual image of the binocular camera output by the environment simulation module, carrying out image preprocessing on the received visual image of the binocular camera, extracting characteristic points of the preprocessed visual image, inputting the extracted characteristic points of the visual image into the visual odometer for calculation to obtain the real-time position of the unmanned platform, and correcting the obtained real-time position of the unmanned platform by using the local map to obtain the relative position of the unmanned platform under the world coordinate system; the relative position information of the unmanned platform under the world coordinate system is released to the reinforcement learning module through the ROS topic and is used as a part of navigation algorithm input to make a decision; meanwhile, the visual perception module establishes a Service end (Service) based on an ROS Service mechanism, and is used for triggering a reset function in the simulation process, namely, the operations of starting point zero setting, map clearing, key frame clearing and the like;

The reinforcement learning module specifically comprises a state feature extraction module responsible for extracting state features and a deep reinforcement learning decision module responsible for path planning, as shown in fig. 2; the RGB image, the depth image and the unmanned platform speed information are sensor information from an environment simulation module, and the position information is the real-time position of the unmanned platform from a visual perception module; the reinforcement learning module outputs the action of the unmanned platform; when training needs to be restarted, the reinforcement learning module serves as a client of the ROS service to send a reset request to a server of the visual perception module, and the reinforcement learning module and the visual perception module are reset at the same time;

The deep reinforcement learning decision module is the core of the system and is used for receiving the unmanned platform state characteristics output by the state characteristic extraction module, outputting the optimal actions of the unmanned platform under the current state characteristics according to the received unmanned platform state characteristics, and transmitting the optimal actions to the unmanned platform in the environment simulation module for execution; important parts in the deep reinforcement learning decision module include action set design, reward function design and algorithm architecture design.

The state feature extraction module consists of a vision processing sub-module, a motion information processing sub-module and a time sequence feature extraction sub-module, as shown in figure 3;

the visual information processing sub-module is used for receiving the 3-channel RGB image and the aligned depth image of the unmanned platform output by the environment simulation module as shown in fig. 4; after preprocessing the RGB image after range scaling, size adjustment and standardization, extracting features by adopting a ResNet-50 network, and outputting RGB features with the shape of 1 multiplied by 512; after preprocessing the depth image after range scaling, size adjustment and standardization, extracting features by adopting a ResNet-50 network, and outputting depth features with the shape of 1 multiplied by 512; then, performing feature aggregation on the RGB features of 1×512 and the depth features of 1×512 to obtain visual features of 1×1024, and mapping the visual features of 1×1024 to 1×512 dimensions through a full connection layer to obtain visual features of an unmanned platform;

Still further, range scaling refers to scaling the data range of the original image to [0,1]; resizing refers to resizing a picture to 3 x 224; the Normalization (Normalization) process of the image is as follows, and for each input channel (channel), the original value minus the mean (mean) divided by the standard deviation (Standard Deviation, std) is the processed result;

the motion information processing sub-module receives the real-time position of the unmanned platform calculated by the visual perception module and the speed information of the unmanned platform output by the environment simulation module, which are all in the NED coordinate system, as shown in fig. 5; the motion information processing sub-module calculates the real-time position information of the unmanned platform and the navigation target point information to obtain the relative position information of the unmanned platform relative to the navigation target point under the NED coordinate system, then uses the quaternion of the unmanned platform provided by the environment simulation module to calculate to obtain the Euler angle, then obtains a rotation matrix, and then calculates the linear speed of the unmanned platform under the unmanned platform body coordinate system and the relative position of the unmanned platform relative to the navigation target point under the unmanned platform body coordinate system; finally, performing feature mapping on the linear velocity and the relative position information through a Multi-Layer Perceptron (MLP) from 6 dimensions to 128 dimensions to obtain the motion features of the unmanned platform;

The time sequence feature extraction submodule is shown in fig. 6, combines action information and rewarding information of the previous period on the basis of visual features and motion features, and obtains final unmanned platform state feature representation, specifically: first, the sub-module accepts the visual feature of size 1×512 output from the visual feature extraction sub-module and the motion feature of size 1×128 output from the motion feature extraction sub-module. Then, aggregating the two to obtain a vector with the size of 1 multiplied by 640, and mapping the vector to 1 multiplied by 512 dimensions through a full connection layer to obtain the state characteristics of the unmanned platform at the current moment; because the unmanned platform adopts a discrete action set, for convenience of representation, single-hot Encoding (One-hot Encoding) is adopted to represent the action information of the previous period, and 1 bit is used for effective Encoding, for example, 8 actions in the action set are adopted, and the action taken by the previous period is 4 th, and the action is represented as a= [0,0,0,1,0,0,0,0]. Finally, the state characteristics of the unmanned platform at the current moment, the last period action information and the last period rewarding information are input into a Long short-term memory (LSTM) for extraction, and finally the state characteristics of the unmanned platform of 1 multiplied by 512 are obtained.

The deep reinforcement learning decision module is shown in fig. 7, and outputs corresponding unmanned platform actions after deciding according to the final state characteristics of 1×512 dimensions; the algorithm used in the deep reinforcement learning decision module will now be described: the deep reinforcement learning algorithm is realized by adopting an Actor-Critic network, a heuristic controller outputs heuristic actions according to the motion information of the current unmanned platform, the Actor network, the Critic network and an action adjusting network are all composed of full connection layers, the inputs of the three networks are all 1X 512-dimensional state characteristics output by a time sequence characteristic extraction module, the Actor network outputs probability distribution of concentrated actions, the probability distribution is sampled to obtain reinforcement learning actions, and the Critic network outputs state estimation values. The action adjusting network selects between the reinforcement learning action and the heuristic action, and the action is similar to a single-pole double-throw switch;

further, the discrete action set of the unmanned platform in the deep reinforcement learning decision module is contained as shown in FIG. 8 _x The motion in the two directions of the axis and the y axis and the rotation around the z axis are totally eight actions, the motion is divided into a front-back left-right unidirectional motion and a left front and right front oblique motion, the rotation is divided into a counterclockwise rotation and a clockwise rotation, wherein the motion speed in the single direction is recommended to be 1m/s, and the motion speed in the two oblique front directions is v _x ＝v _y Linear motion =0.8 m/sThe duration of the movement is 1s, and the angular velocity of the rotational movement isThe duration is 0.5s;

further, the design of the reward function in the deep reinforcement learning decision module, i.e., the reinforcement learning PPO controller algorithm, is described as follows: the rewarding function r aiming at the target point navigation task in the invention is composed of a display rewarding r _f Collision reward r _c Step prize r _s Distance rewards r _d Four parts; wherein explicit rewards r _f Is rewarded to reach the end point, and l represents the distance between the current intelligent agent and the target point, and the current position of the intelligent agent is p _t ＝(x _t ,y _t ) If the position of the target point is d= (d _x ,d _y ) Then:the rewards earned for completing a task can be expressed as: r is (r) _f ＝k _f When l is smaller than the judgment value epsilon, the navigation task is completed, and the intelligent agent receives the navigation information with the size k _f Or r _f =0; collision rewards refer to the need to avoid obstacles during navigation, so when an agent collides with an obstacle, the collision penalty can be expressed as: r is (r) _c ＝-k _c ，k _C Is a normal constant; step rewards are defined as long as the intelligent agent can complete the navigation task as soon as possible, so a punishment determined by a step is additionally designed, a small punishment item is obtained when each action is executed, and the total punishment is also increased along with the increase of the step, wherein compared with linear punishment, the step type punishment is more practical, and because a plurality of paths with better steps are possible, the punishment takes the form: r is (r) _s ＝-k _s ·logn _s Wherein k is _s Is a small normal quantity, n _s Representing the number of steps in a round up to now; distance rewards are those where empirically the status value should be higher the closer to the target point, so that if one step is taken, the unmanned platform is awayIf the target point is closer, a smaller prize may be achieved, and if it is farther from the target point, a smaller penalty is given in the form of: r is (r) _d ＝k _d ·　l _Δ Wherein k is _d Represents a normal quantity approaching 0, l _Δ Indicating the change of the distance between the target points after the current action is executed, if l _Δ >0 indicates that the distance is shortened, r _d If yes, the unmanned platform obtains a positive reward, otherwise, l _Δ <0 indicates away from the target, r _d If negative, the unmanned platform obtains a penalty; in combination with the above four items, the reward function of the unmanned platform under the target point navigation task can be expressed as: r=r _f +r _c +r _s +r _d ；

Further, as shown in fig. 9, the algorithm architecture design in the deep reinforcement learning decision module is that the current state of the unmanned platform is input to the PPO network and the heuristic controller at the same time, they output reinforcement learning action and heuristic action respectively, the action adjustment network makes a decision according to the current state, the reinforcement learning action or heuristic action is executed, the agent executes the decision, and the actually executed action and the decision are sent to the experience pool, and then the PPO network and the action adjustment network are updated;

Specifically, the heuristic controller can select the action that the next position is closest to the Euclidean distance of the target point after traversing the discrete action set, so that random invalid exploration is avoided, and convergence of the reinforcement learning strategy network is accelerated; the learning and decision of the action adjustment network involve two actions, namely a reinforcement learning action output by the PPO controller and an action output by the heuristic controller; the action adjustment network needs to judge what action value is larger in the current state, and then execute the action with larger value; the action adjusting network learns that under a certain state, heuristic action and reinforcement learning action are better by estimating the value; the update of the action adjustment network adopts a VPG algorithm; the whole navigation algorithm puts the action actually executed by the unmanned platform into an experience pool, so that the experience pool stores the optimal action evaluated by the current action adjustment network; the PPO controller is updated through an experience pool, and experience with heuristic actions can help the PPO algorithm converge to the optimum faster than experience obtained through random strategies; as the PPO network increases its corresponding value, the action-adjusting network will gradually bias towards actions that employ the PPO controller output.

step S1: initializing actions to adjust network parameters

Step S3: initializing an experience playback pool R;

Step S6: sampling PPO policy network output action a _t ⁰ ；

Step S7: sampling heuristic controller output actions

Step S10: judging S _t+1 -p _d If it approaches 0, let d _t+1 =1; if it does not approach 0, let d _t+1 ＝0；

Step S11: will(s) _t ,a _t ,r _t+1 ,s _t+1 ,d _t+1 ) Storing into R;

step S13: randomly sampling a sample of a miniband size from R;

Step S15: updating the PPO network and updating the action adjustment network;

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. An unmanned platform reinforcement learning autonomous navigation system based on visual input is characterized in that:

the environment simulation module is used for outputting speed information, RGB image information and depth image information of the simulation unmanned platform to the reinforcement learning module, and is also used for outputting visual images of the binocular camera to the visual perception module;

the reinforcement learning module is used for receiving the speed information, the RGB image information and the depth image information of the unmanned platform output by the environment simulation module and the relative position of the unmanned platform under the world coordinate system output by the visual perception module, and outputting the actions of the unmanned platform to the unmanned platform in the environment simulation module according to the received speed information, the RGB image information, the depth image information and the relative position of the unmanned platform under the world coordinate system.

2. The unmanned platform reinforcement learning autonomous navigation system based on visual input of claim 1, wherein:

the environment simulation module is a simulation environment formed by a UE4 engine and an AirSim plug-in, wherein the UE4 engine is responsible for building and rendering a simulation environment required by the action of the unmanned platform, and the AirSim is responsible for introducing simulation models of the unmanned platform such as four rotors, unmanned vehicles and the like; the environment simulation module provides a sensor interface and a control interface of the unmanned platform, the environment simulation module provides speed information, RGB image information and aligned depth image information of the unmanned platform to the reinforcement learning module through the sensor interface, and control signals of actions of the unmanned platform are transmitted to the unmanned platform in a simulation environment through the control interface of the environment simulation module, so that simulation actions of the unmanned platform are completed; visual images of the unmanned platform binocular camera in the simulation environment are transmitted into the visual perception module through topics of ROS.

3. An unmanned platform reinforcement learning autonomous navigation system based on visual input according to claim 1 or 2, wherein:

the visual perception module performs image preprocessing on the received visual image of the binocular camera, extracts characteristic points of the preprocessed visual image, inputs the extracted characteristic points of the visual image to the visual odometer for calculation, so as to obtain the real-time position of the unmanned platform, and corrects the obtained real-time position of the unmanned platform by using the local map;

The visual perception module is built based on the ROS, relative position information of the unmanned platform under the world coordinate system, calculated by the visual perception module, is released to the reinforcement learning module through the ROS topic to be used as a part of navigation algorithm input for decision making, and meanwhile, the visual perception module establishes a service end based on an ROS service mechanism and is used for triggering a reset function in a simulation process, namely zero starting point setting, map clearing and key frame clearing operation;

step S1: acquiring a visual image of the binocular camera through the ROS;

step S2: preprocessing the visual image;

step S3: inputting the preprocessed visual image into a visual odometer for calculation to obtain the preliminary real-time position of the unmanned platform;

step S4: updating the local map and optimizing the preliminary real-time position through the information of the local map;

step S5: and sending the optimized real-time position to the reinforcement learning module through the ROS.

4. An unmanned platform reinforcement learning autonomous navigation system based on visual input according to claim 1 or 2, wherein:

The reinforcement learning module consists of a state feature extraction module responsible for extracting state features and a deep reinforcement learning decision module responsible for path planning; when the reinforcement learning training needs to be restarted, the reinforcement learning module serves as a client of the ROS service to send a reset request to a service end of the visual perception module, and the reinforcement learning module and the visual perception module are reset at the same time.

5. The unmanned platform reinforcement learning autonomous navigation system based on visual input of claim 4, wherein:

the state feature extraction module is used for receiving and processing original information from the environment simulation module and the visual perception module, wherein the original information comprises RGB image and depth image information given by the environment simulation module, speed information of the unmanned platform, real-time position information of the unmanned platform given by the visual perception module and time sequence information generated in the navigation training process; the state feature extraction module is used for respectively processing and carrying out feature aggregation on the information to obtain final unmanned platform state features and outputting the final unmanned platform state features to the deep reinforcement learning decision module;

the deep reinforcement learning decision module is used for receiving the unmanned platform state characteristics output by the state characteristic extraction module, outputting the optimal actions of the unmanned platform under the current state characteristics according to the received unmanned platform state characteristics, and executing the actions by the unmanned platform in the environment simulation module, wherein the deep reinforcement learning decision module comprises an action set design, a reward function design and an algorithm architecture design.

6. The unmanned platform reinforcement learning autonomous navigation system based on visual input of claim 5, wherein:

the time sequence feature extraction sub-module combines action information and rewarding information of the previous period of reinforcement learning on the basis of visual features and motion features of the unmanned platform to obtain final unmanned platform state features.

7. The unmanned platform reinforcement learning autonomous navigation system based on visual input of claim 6, wherein:

step S16: and inputting the current time state characteristics, the last period action information and the last period rewarding information of the unmanned platform into the LSTM network together for extraction, and finally obtaining the state characteristics of the unmanned platform of 1 multiplied by 512.

8. The unmanned platform reinforcement learning autonomous navigation system based on visual input of claim 5, wherein:

the action set of the deep reinforcement learning decision module comprises a front-back left-right unidirectional motion, a left front oblique motion, a right front oblique motion, a counterclockwise rotation and a clockwise rotation, and a total of eight discrete action sets of actions;

The reward function of the deep reinforcement learning decision module, namely, the reward function r in the reinforcement learning PPO controller algorithm is displayed with the reward r _f Collision reward r _c Step prize r _s Distance rewards r _d Four parts;

the algorithm architecture of the deep reinforcement learning decision module is designed in such a way that the current state of the unmanned platform is simultaneously input into the PPO network and the heuristic controller, the PPO network and the heuristic controller respectively output reinforcement learning actions and the heuristic actions, the action adjustment network makes decisions according to the current state, reinforcement learning actions or heuristic actions are executed, an agent executes the decisions, the actually executed actions and the decisions are sent into an experience pool, and then the PPO network and the action adjustment network are updated.

9. The unmanned platform reinforcement learning autonomous navigation system based on visual input of claim 8, wherein:

step S1: initializing actions to adjust network parameters

Step S3: initializing an experience playback pool R;

step S5: obtaining an initial shape of an unmanned platform State s ₀ And target point position p _d ；

Step S6: sampling PPO policy network output actions

Step S7: sampling heuristic controller output actions

Step S11: will(s) _t ,a _t ,r _t+1 ,s _t+1 ,d _t+1 ) Storing into R;

step S13: randomly sampling a sample of a miniband size from R;

Step S15: updating the PPO network and updating the action adjustment network;

10. The unmanned platform reinforcement learning autonomous navigation method based on visual input is characterized by comprising the following steps: