CN114415657A

CN114415657A - Cleaning robot wall-following method based on deep reinforcement learning and cleaning robot

Info

Publication number: CN114415657A
Application number: CN202111503328.1A
Authority: CN
Inventors: 王毓玮
Original assignee: Anker Innovations Co Ltd
Current assignee: Anker Innovations Co Ltd
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-04-29

Abstract

A cleaning robot wall-following method based on deep reinforcement learning and a cleaning robot, the method comprises the following steps: when a cleaning robot enters a scene along a wall, acquiring measurement data of a sensor of the cleaning robot; inputting the measurement data into a trained deep reinforcement learning network, and outputting a wall-following action by the deep reinforcement learning network; controlling the cleaning robot to complete the movement and cleaning operation of the wall scene based on the wall action; the training of the deep reinforcement learning network is realized based on an asynchronous dominant motion evaluation algorithm of synchronous training of a plurality of agents. According to the scheme, the deep reinforcement learning network is trained based on the asynchronous dominant motion evaluation algorithm of synchronous training of a plurality of agents, and the trained deep reinforcement learning network can output the optimal wall-following motion as long as the trained deep reinforcement learning network receives sensor data of the cleaning robot in the wall-following scene, so that the cleaning robot can well complete the cleaning of the wall-following scene.

Description

Cleaning robot wall-following method based on deep reinforcement learning and cleaning robot

Technical Field

The application relates to the technical field of cleaning robots, in particular to a cleaning robot wall following method based on deep reinforcement learning and a cleaning robot.

Background

Cleaning robots (called sweepers) have gradually entered the homes of people in recent years and replaced people in performing some daily cleaning chores. The sweeper has the advantage of cleaning a family area which is not easy to clean by people at ordinary times but can be touched by the sweeper. The user experience is also mainly focused on both cleaning efficiency and cleaning coverage. Compared with the conventional sweeper without path planning and navigation, the sweeper based on a laser (Lidar) sensor and a time of flight (Tof) sensor, which is popular at present, has obvious improvement on the two points in the sweeping process. In the cleaning process of the laser sweeper, the quality of the wall-following technology of the robot has important influence on the cleaning efficiency and the coverage area.

The existing laser sweeper along-the-wall method is mainly based on Proportional Integral Derivative (PID) control. The distance and the angle of a self edge object (hereinafter referred to as a wall in a unified way) relative to the machine body are measured through a Tof distance sensor and 2D Lidar (hereinafter referred to as Lidar) distance data, and meanwhile, PID control is carried out on the differential motion model according to the self linear velocity and the angular velocity. In the process of following the wall, two major problems are mainly faced: firstly, the traditional PID has Lidar and Tof distance sensors input along a wall algorithm, the Lidar and the Tof distance sensors need to be fused to control a robot, the algorithm is relatively complex, the covering heights of the Lidar and the Tof distance sensors are generally inconsistent, and the Lidar data cannot necessarily obtain the same distance information when the Tof has relative wall data; secondly, a set of PID control parameters is difficult to be suitable for complex household environments, for example, sometimes the environment is that a straight line suddenly has a 90-degree turning angle, sometimes the straight line has a 45-degree turning angle, and meanwhile, a cylindrical wall-following scene with different radians may exist. At present, all scenes which can appear and corresponding control strategies are difficult to integrate into a set of PID control algorithm in the algorithm, especially, millions of real scenes can be combined, different combinations are difficult to obtain, and specific rules need to be found out for the scenes to make specific wall-following control output, which is difficult to realize.

Disclosure of Invention

The present application is proposed to solve the above problems. According to an aspect of the application, a cleaning robot wall following method based on deep reinforcement learning is provided, and the method comprises the following steps: when a cleaning robot enters a scene along a wall, acquiring measurement data of a sensor of the cleaning robot; inputting the measurement data into a trained deep reinforcement learning network, and outputting a wall-following action by the deep reinforcement learning network; controlling the cleaning robot to complete the movement and cleaning operation of the wall scene based on the wall action; the training of the deep reinforcement learning network is realized based on an asynchronous dominant motion evaluation algorithm of synchronous training of a plurality of agents.

In one embodiment of the present application, the training of the deep reinforcement learning network includes: constructing a plurality of simulated environments, wherein each simulated environment comprises a wall surface and an obstacle, and the distance between the obstacle and the wall surface is not more than the transverse dimension of the cleaning robot; for each simulation environment, performing wall-following actions by one agent in the simulation environment based on the copy of the deep reinforcement learning network to train the copy of the deep reinforcement learning network so as to obtain local copy parameters; and updating the global network parameters of the deep reinforcement learning network based on the local copy parameters obtained by training of each agent so as to obtain the trained deep reinforcement learning network.

In an embodiment of the present application, the training the copy of the deep reinforcement learning network includes: in each training round, acquiring simulated measurement data of the agent, inputting the simulated measurement data into the replica, outputting a wall-following action by the replica, and executing the wall-following action output by the agent; in each training round, when the wall-following exercise is finished, updating a cost function based on relevant parameters of the wall-following exercise, and updating network parameters of the replica based on the cost function; and when the preset training times are reached or the preset training time is reached, the network parameters of the copies are used as the local copy parameters for updating the global network parameters.

In one embodiment of the present application, the relevant parameters of the movement along the wall include at least one of: the time used by the agent from beginning along the wall to ending along the wall; the intelligent agent gets closer to or farther from the end point in the wall-following movement; whether the intelligent agent exceeds a preset virtual boundary line in the wall-following motion; an area of the agent between a trajectory of the agent in the wall-following motion and an edge of the obstacle in the simulated environment.

In one embodiment of the present application, the distance from the predetermined virtual boundary line to the wall surface is equal to 1.5 times the lateral dimension of the cleaning robot.

In an embodiment of the present application, the updating global network parameters of the deep reinforcement learning network based on local replica parameters obtained by training of each agent includes: periodically updating global network parameters of the deep reinforcement learning network based on local copy parameters obtained by training of each agent; or after the preset training times are reached, updating the global network parameters of the deep reinforcement learning network based on the local copy parameters obtained by training of each agent.

In one embodiment of the present application, the sensors of the cleaning robot include a radar sensor and a time-of-flight sensor, and the inputting the measurement data into a trained deep reinforcement learning network includes: splicing the measurement data of the flight time sensor into data with the same dimensionality as the measurement data of the radar sensor according to time sequence; and normalizing the spliced data and the measurement data of the radar sensor and inputting the normalized data into the trained deep reinforcement learning network.

In one embodiment of the present application, the deep reinforcement learning network includes two convolutional layers and one fully connected layer.

According to another aspect of the present application, there is provided a cleaning robot including a memory, a processor, a sensor, a motion module, and a cleaning assembly, wherein: the sensor is used for collecting environmental data aiming at an area to be cleaned; the memory has stored thereon computer readable instructions executed by the processor, which when executed by the processor, cause the processor to perform the deep reinforcement learning based cleaning robot wall following method of any one of claims 1-8 to output a wall following action based on data collected by the sensor; the motion module and the cleaning assembly are used for completing the motion and cleaning operation of the wall scene based on the wall motion output by the processor.

In one embodiment of the present application, the sensors include a radar sensor and a time-of-flight sensor.

According to the cleaning robot wall-following method based on deep reinforcement learning and the cleaning robot, a deep reinforcement learning network is trained based on an asynchronous dominant motion evaluation algorithm of synchronous training of a plurality of agents, the trained deep reinforcement learning network can output the optimal wall-following motion as long as the trained deep reinforcement learning network receives sensor data of the cleaning robot in a wall-following scene, and the cleaning robot can well complete the cleaning of the wall-following scene; the robot learns the wall-following algorithm suitable for the training scene according to the learning result of the simulation environment without designing the corresponding PID wall-following algorithm according to various environments; and the LIdar and the Tof data are input simultaneously according to the learned network and the optimal motion control output of the current state is obtained without fusing the LIdar and the Tof data or switching the strategy along the wall according to specific conditions.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 shows a schematic flow diagram of a cleaning robot along-the-wall method based on deep reinforcement learning according to an embodiment of the present application.

Fig. 2 shows a schematic diagram of a deep reinforcement learning network used in a cleaning robot along-the-wall method based on deep reinforcement learning according to an embodiment of the present application.

Fig. 3 shows a schematic diagram of training network parameters in a cleaning robot along-wall method based on deep reinforcement learning according to an embodiment of the present application.

FIG. 4 illustrates an example diagram of a simulation environment employed in training a deep reinforcement learning network in a deep reinforcement learning based cleaning robot along a wall method according to an embodiment of the present application.

Fig. 5 shows a schematic block diagram of a cleaning robot according to an embodiment of the application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, exemplary embodiments according to the present application will be described in detail below with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the application described in the application without inventive step, shall fall within the scope of protection of the application.

Fig. 1 shows a schematic flow diagram of a cleaning robot along-the-wall method 100 based on deep reinforcement learning according to an embodiment of the present application. As shown in fig. 1, a deep reinforcement learning based cleaning robot along-the-wall method 100 may include the steps of:

in step S110, when the cleaning robot enters the along-the-wall scene, measurement data of a sensor of the cleaning robot is acquired.

In step S120, the measurement data is input into the trained deep reinforcement learning network, and the deep reinforcement learning network outputs the wall-following action, wherein the training of the deep reinforcement learning network is implemented based on the asynchronous dominant action evaluation algorithm of the synchronous training of the plurality of agents.

In step S130, the cleaning robot is controlled to complete the movement and cleaning operation of the along-wall scene based on the along-wall motion.

In the embodiment of the application, a Deep Reinforcement Learning (DRL) network is trained based on an Asynchronous dominant action evaluation (A3C) algorithm of synchronous training of a plurality of agents, and the trained Deep Reinforcement Learning network can output an optimal wall-following action as long as receiving sensor data of a cleaning robot in a wall-following scene, so that the cleaning robot can better complete cleaning of the wall-following scene. This is because, in recent years, the deep reinforcement learning technology goes from simple single agent control in games to Alpha Go and Alpha Zero which defeat the first Go players in the world in real life, then to the electric competitive game with multi-agent cooperation and the control result of the amazing robot in reality, and it can be seen that DRL has some outstanding advantages in a certain scene, and the optimal output control of a complex state space can be learned according to the designed reward and punishment rules and training of many rounds. The optimal action selected by the conventional Actor-Critic algorithm Actor is to analyze the TD error corresponding to the value of the current state and the value of the next state according to Critic to update, the A3C algorithm proposed by deep Mind can simultaneously train the parallel learning algorithms of a plurality of agents by utilizing multiple cores, and the network parameters are shared and updated by the learning of the plurality of agents, so that the problem of non-convergence of the Actor-Critic is avoided, and the feasibility of the algorithm applied to the actual complex scene is improved. Therefore, based on the above advantages of the A3C algorithm, the deep reinforcement learning network is constructed for various wall-following scenes, the deep reinforcement learning network is trained through various wall-following scenes, certain feedback is given to each wall-following action according to reward and punishment rules, and after a training round for a certain number of times/time, the trained deep reinforcement learning network can be obtained and autonomously learns the wall-following method in the training process.

In an embodiment of the present application, the constructed deep reinforcement learning network may include two convolutional layers and one fully connected layer. Described below in conjunction with fig. 2.

Fig. 2 shows a schematic diagram of a deep reinforcement learning network used in a cleaning robot along-the-wall method based on deep reinforcement learning according to an embodiment of the present application. As shown in fig. 2, the deep reinforcement learning network includes two convolutional layers (both shown as conv1D in the figure) and one fully connected layer (shown as sense in the figure). The input to the network is sensor data of the cleaning robot (in the figure, the cleaning robot includes a radar sensor and a time-of-flight sensor for example, and thus the input to the network is the respective measurement data of the radar sensor and the time-of-flight sensor). The Policy (Policy) function and the Value (Value) function in the network determine the optimal robot to move along the wall and output.

Because the radar sensor and the time-of-flight sensor collect data with different dimensions, and the collection frequency of a general time-of-flight sensor is higher, in the embodiment of the application, the measurement data of the time-of-flight sensor can be spliced into data with the same dimensions as the measurement data of the radar sensor according to time sequence, and then the spliced data and the measurement data of the radar sensor are input into the deep reinforcement learning network after being normalized. And continuously learning and updating the optimal action output corresponding to the input quantity by the Policy function and the Value function according to the sensor input by combining the change of the environment and the fraction change given by the reward and punishment rule.

In an embodiment of the present application, the training of the deep reinforcement learning network may include: constructing a plurality of simulation environments, wherein each simulation environment comprises a wall surface and a barrier, and the distance between the barrier and the wall surface is not more than the transverse dimension of the cleaning robot; for each simulation environment, an intelligent agent performs wall-following action on the basis of the copy of the deep reinforcement learning network in the simulation environment so as to train the copy of the deep reinforcement learning network, thereby obtaining local copy parameters; and updating global network parameters of the deep reinforcement learning network based on the local copy parameters obtained by training of each agent so as to obtain the trained deep reinforcement learning network. Described below in conjunction with fig. 3 and 4.

Fig. 3 shows a schematic diagram of training network parameters in a cleaning robot along-wall method based on deep reinforcement learning according to an embodiment of the present application. As shown in fig. 3, the deep reinforcement learning network to be trained itself serves as a global network, and parameters in the global network are called global network parameters. The global network can be copied into a plurality of copy networks, namely deep reinforcement learning network copies (which may be referred to as network policy copies, network copies or copies for short), each deep reinforcement learning network copy is trained by an agent under one environment (as shown in fig. 3, agent 1 to agent n, and environment 1 to environment n), and network parameters obtained when each deep reinforcement learning network is trained are referred to as local network parameters. The global network parameters of the deep reinforcement learning network can be updated based on the local copy parameters obtained by training of each agent, so that the trained deep reinforcement learning network can be obtained.

In the embodiment of the present application, the environment corresponding to each agent may be a real environment or a simulation environment. In one example, a simulated home environment can be prepared in a simulation environment, a complex home environment can be simplified into a wall which can be repeatedly trained, a wall-following algorithm can only be effective when the wall is close to the wall or effective Tof data exists in the cleaning process, and a wall can simplify a complex home structure. The intelligent agent obtains actions along the wall according to the strategy of each deep reinforcement learning, and moves from one end (starting point) of the wall to the other end (end point) of the wall to serve as a training round; the agent continuously learns repeatedly in the simulation environment to update the network parameters. Described below in conjunction with fig. 4.

FIG. 4 is a diagram illustrating an example of a simulation environment employed in training a deep reinforcement learning network in a deep reinforcement learning based cleaning robot wall-following method according to an embodiment of the present application. As shown in fig. 4, an obstacle is generated between the start point and the end point and placed at a position close to the wall surface, the obstacle is not more than one position (lateral dimension) of the sweeper body from the wall surface, so as to avoid generating an invalid obstacle.

Based on such a simulation environment, training a copy of the deep reinforcement learning network may include: in each training round, acquiring simulation measurement data of the agent (wherein Gaussian noise can be added to the simulation measurement data to simulate measurement errors), inputting the simulation measurement data into the replica, outputting a wall-following action by the replica, and executing the wall-following action output by the replica by the agent; in each training round, when the wall motion is finished, updating a cost function based on the relevant parameters of the wall motion, and updating the network parameters of the copy based on the cost function; and when the preset training times are reached or the preset training time is reached, the network parameters of the copies are used as local copy parameters for updating the global network parameters. Firstly, after receiving input simulation measurement data, the copy randomly generates a group of linear velocity and angular velocity in simulation according to a simulation intelligent agent motion model and sends the linear velocity and the angular velocity to the intelligent agent, and after the intelligent agent executes the wall-following action, the environment and the state of the intelligent agent are updated; in the subsequent training process, the strategy of the value evaluation and the action of strategy output are continuously updated, and the learned experience is input into the global network by a certain discount function.

Based on this, updating the global network parameters of the deep reinforcement learning network based on the local copy parameters obtained by training each agent may include: periodically updating global network parameters of the deep reinforcement learning network based on local copy parameters obtained by training of each agent; or after the preset training times are reached, updating the global network parameters of the deep reinforcement learning network based on the local copy parameters obtained by training of each agent.

In an embodiment of the present application, the aforementioned related parameters of the movement along the wall may include at least one of the following: the time used by the agent from beginning along the wall to ending along the wall; the distance between the intelligent agent and the end point becomes close or far in the process of moving along the wall; whether the intelligent agent exceeds a preset virtual boundary line during the movement along the wall (for example, the preset virtual boundary line is relatively parallel to the wall surface, and the distance from the preset virtual boundary line to the wall surface is equal to 1.5 times of the transverse size of the cleaning robot, as shown in fig. 4); the area of the agent between the trajectory of motion in moving along the wall and the edge of the obstacle in the simulated environment.

In this embodiment, time, distance, virtual line, and coverage are exemplary key factors in determining the cost. For example, for a time factor, there will be a small negative feedback as time increases as one starts from the side of the wall from the agent. For the distance factor, the closer to the end point, the more positive feedback, and vice versa, a negative feedback is obtained. For the virtual line factor, the intelligent body can obtain a great negative feedback when the simulation environment exceeds the virtual line and ends the turn (namely, the intelligent body takes a punishment when being far away from the wall and exceeding the virtual line in each training turn, and ends the single training and updates the network parameters). For coverage factors, a corresponding amount of negative feedback is obtained based on the area created between each motorway and the edge of the obstacle.

In other embodiments, other parameters may also be used as exemplary key factors in determining the cost. For example, the uncovered area between the agent trajectory to the wall surface is taken as a cost function of one training round, and so on.

In general, training of the deep reinforcement learning network can be achieved by the A3C algorithm of multi-agent synchronous training of DRL. The A3C algorithm has the advantages that a plurality of agents are used for simultaneously training different local copy parameters, and one global network parameter is regularly updated and shared, so that the plurality of agents simultaneously use sample parameters learned by other agents, and the training efficiency is accelerated. Each agent has a network replica that predicts the cost function v(s) (whether a state s gets a good value) and the policy pi(s) (a series of probabilistic action outputs), as shown in fig. 3, and continuously updates Critic for value evaluation and Actor for policy output, and these learned experiences are input to the global network with a certain discount function.

The training sample can be represented as e (s, a, r, s ', d), where e represents interaction between a main body and an environment, s represents an environment state, a represents an action of the robot, r represents reward and punishment feedback of the action a of the robot in the state s, s' represents a new environment state after the action of a is performed, and d represents whether the new state is a termination state (for example, the training round is terminated after exceeding a preset virtual boundary line); the updating of the cost function Q (s, a) is updated according to the cost corresponding to each current time plus the cost corresponding to the environment at the next time when the maximum reward action is taken, and the key factors determining the cost, such as time, distance, virtual line, coverage, are as described above.

After training is completed to obtain a strategy network (namely a trained deep reinforcement learning network), when the sweeper starts to follow the wall, data collected by Tof and Lidar can be formed into a vector to be input into the network, the network correspondingly outputs the current optimal action to the sweeper, the process is continuously repeated until the wall following module is quitted (the wall following module starts and ends and can be triggered and divided into a plurality of sections of wall following through a collision sensor, or the corresponding edge state is found in robot sweeping logic to be switched), the wall following module is ended, and control right is exchanged to other modules of the sweeper.

Based on the above description, the cleaning robot wall-following method based on deep reinforcement learning according to the embodiment of the application trains the deep reinforcement learning network based on the asynchronous dominant motion evaluation algorithm of the synchronous training of a plurality of agents, and the trained deep reinforcement learning network can output the optimal wall-following motion as long as receiving the sensor data of the cleaning robot in the wall-following scene, so that the cleaning robot can better complete the cleaning of the wall-following scene; the robot learns the wall-following algorithm suitable for the training scene according to the learning result of the simulation environment without designing the corresponding PID wall-following algorithm according to various environments; and the Lidar and the Tof data are input simultaneously according to the learned network and the optimal motion control output of the current state is obtained without fusing the Lidar and the Tof data or switching the strategy along the wall according to specific conditions.

The wall following method of the cleaning robot based on deep reinforcement learning according to the embodiment of the application is exemplarily shown above. A cleaning robot provided according to another aspect of the present application is described below with reference to fig. 5. Fig. 5 shows a schematic block diagram of a cleaning robot 500 according to an embodiment of the present application. As shown in fig. 5, the cleaning robot 500 includes a memory 510, a processor 520, a sensor 530, a motion module 540, and a cleaning assembly 550, wherein: sensor 530 is used to collect environmental data for the area to be cleaned; the memory 510 has stored thereon computer readable instructions executed by the processor 520 which, when executed by the processor 520, cause the processor 520 to perform the foregoing deep reinforcement learning based cleaning robot wall following method to output a wall following action based on data collected by the sensor 530; the motion module 540 and cleaning assembly 550 components are used to complete the motion and cleaning operations along the wall scene based on the along-the-wall motion output by the processor 520. In embodiments of the present application, sensor 530 may include a radar sensor and a time-of-flight sensor. The structure of the cleaning robot 500 and the specific operation thereof can be understood by those skilled in the art with reference to the foregoing description, and for the sake of brevity, the detailed description thereof is omitted here.

Further, according to an embodiment of the present application, there is also provided a storage medium having stored thereon program instructions for executing the steps of the deep reinforcement learning based cleaning robot wall following method according to the embodiment of the present application when the program instructions are executed by a computer or a processor. The storage medium may include, for example, a memory card of a smart phone, a storage component of a tablet computer, a hard disk of a personal computer, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, or any combination of the above storage media.

Furthermore, according to an embodiment of the present application, there is also provided a computer program, which when executed by a computer or a processor, is configured to perform the corresponding steps of the deep reinforcement learning based cleaning robot wall following method according to the embodiment of the present application.

Based on the above description, according to the cleaning robot wall-following method based on deep reinforcement learning and the cleaning robot wall-following method based on asynchronous dominant motion evaluation algorithm of synchronous training of a plurality of agents, the deep reinforcement learning network is trained, and the trained deep reinforcement learning network can output the optimal wall-following motion as long as the trained deep reinforcement learning network receives sensor data of the cleaning robot in a wall-following scene, so that the cleaning robot can better complete the cleaning of the wall-following scene; the robot learns the wall-following algorithm suitable for the training scene according to the learning result of the simulation environment without designing the corresponding PID wall-following algorithm according to various environments; and the LIdar and the Tof data are input simultaneously according to the learned network and the optimal motion control output of the current state is obtained without fusing the LIdar and the Tof data or switching the strategy along the wall according to specific conditions.

Although the example embodiments have been described herein with reference to the accompanying drawings, it is to be understood that the above-described example embodiments are merely illustrative and are not intended to limit the scope of the present application thereto. Various changes and modifications may be effected therein by one of ordinary skill in the pertinent art without departing from the scope or spirit of the present application. All such changes and modifications are intended to be included within the scope of the present application as claimed in the appended claims.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the units is only one logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another device, or some features may be omitted, or not executed.

In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the description of exemplary embodiments of the present application, various features of the present application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the method of the present application should not be construed to reflect the intent: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.

It will be understood by those skilled in the art that all of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where such features are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some of the modules according to embodiments of the present application. The present application may also be embodied as apparatus programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

The above description is only for the specific embodiments of the present application or the description thereof, and the protection scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope disclosed in the present application, and shall be covered by the protection scope of the present application. The protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A cleaning robot wall following method based on deep reinforcement learning, which is characterized by comprising the following steps:

when a cleaning robot enters a scene along a wall, acquiring measurement data of a sensor of the cleaning robot;

inputting the measurement data into a trained deep reinforcement learning network, and outputting a wall-following action by the deep reinforcement learning network;

controlling the cleaning robot to complete the movement and cleaning operation of the wall scene based on the wall action;

the training of the deep reinforcement learning network is realized based on an asynchronous dominant motion evaluation algorithm of synchronous training of a plurality of agents, a plurality of simulation environments are built in the training process, each simulation environment comprises a wall surface and a barrier, and the distance between the barrier and the wall surface is not larger than the transverse size of the cleaning robot.

2. The method of claim 1, wherein the training of the deep reinforcement learning network comprises:

for each simulation environment, performing wall-following actions by one agent in the simulation environment based on the copy of the deep reinforcement learning network to train the copy of the deep reinforcement learning network so as to obtain local copy parameters;

and updating the global network parameters of the deep reinforcement learning network based on the local copy parameters obtained by training of each agent so as to obtain the trained deep reinforcement learning network.

3. The method of claim 2, wherein training the replica of the deep reinforcement learning network comprises:

in each training round, acquiring simulated measurement data of the agent, inputting the simulated measurement data into the replica, outputting a wall-following action by the replica, and executing the wall-following action output by the agent;

in each training round, when the wall-following exercise is finished, updating a cost function based on relevant parameters of the wall-following exercise, and updating network parameters of the replica based on the cost function;

and when the preset training times are reached or the preset training time is reached, the network parameters of the copies are used as the local copy parameters for updating the global network parameters.

4. The method of claim 3, wherein the parameters associated with the wall motion comprise at least one of:

the time used by the agent from beginning along the wall to ending along the wall;

the intelligent agent gets closer to or farther from the end point in the wall-following movement;

whether the intelligent agent exceeds a preset virtual boundary line in the wall-following motion;

an area of the agent between a trajectory of the agent in the wall-following motion and an edge of the obstacle in the simulated environment.

5. The method of claim 4, wherein the distance from the predetermined virtual boundary line to the wall surface is equal to 1.5 times the lateral dimension of the cleaning robot.

6. The method of claim 2, wherein the updating global network parameters of the deep reinforcement learning network based on the local replica parameters obtained by each of the agent training comprises:

periodically updating global network parameters of the deep reinforcement learning network based on local copy parameters obtained by training of each agent; or

And after the preset training times are reached, updating the global network parameters of the deep reinforcement learning network based on the local copy parameters obtained by training of each agent.

7. The method of any one of claims 1-6, wherein the sensors of the cleaning robot include a radar sensor and a time-of-flight sensor, and the inputting the measurement data into a trained deep reinforcement learning network comprises:

splicing the measurement data of the flight time sensor into data with the same dimensionality as the measurement data of the radar sensor according to time sequence;

and normalizing the spliced data and the measurement data of the radar sensor and inputting the normalized data into the trained deep reinforcement learning network.

8. The method of claim 7, wherein the deep reinforcement learning network comprises two convolutional layers and one fully connected layer.

9. A cleaning robot comprising a memory, a processor, a sensor, a motion module, and a cleaning assembly, wherein:

the sensor is used for collecting environmental data aiming at an area to be cleaned;

the memory has stored thereon computer readable instructions executed by the processor, which when executed by the processor, cause the processor to perform the deep reinforcement learning based cleaning robot wall following method of any one of claims 1-8 to output a wall following action based on data collected by the sensor;

the motion module and the cleaning assembly are used for completing the motion and cleaning operation of the wall scene based on the wall motion output by the processor.

10. The cleaning robot of claim 9, wherein the sensor comprises a radar sensor and a time-of-flight sensor.