CN112987713A

CN112987713A - Control method and device for automatic driving equipment and storage medium

Info

Publication number: CN112987713A
Application number: CN201911298066.2A
Authority: CN
Inventors: 黄萱昆; 浦世亮; 熊江; 谢迪
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-12-17
Filing date: 2019-12-17
Publication date: 2021-06-18

Abstract

The application discloses a control method and device of automatic driving equipment and a storage medium, and belongs to the technical field of intelligent equipment. The method comprises the following steps: determining a first state sequence, wherein the first state sequence comprises global environment information; determining the position probability distribution of the current target point through a first neural network model based on the first state sequence; determining action spaces corresponding to the multiple moments through a second neural network model based on second state sequences corresponding to the multiple moments and position probability distribution, and controlling the automatic driving equipment to drive based on the action spaces corresponding to the multiple moments; the second state sequence corresponding to each moment comprises local environment information around the position of the automatic driving equipment at each moment, and the action space corresponding to each moment is used for indicating the action to be executed at each moment. The method and the device can avoid the phenomenon that a plurality of automatic driving devices are jammed.

Description

Control method and device for automatic driving equipment and storage medium

Technical Field

The present disclosure relates to the field of intelligent devices, and in particular, to a method and an apparatus for controlling an automatic driving device, and a storage medium.

Background

Currently, autopilot devices such as AGVs (Automated Guided vehicles) are widely used in some scenarios, such as Automated warehouse systems, where AGVs can be used to automatically transport goods without manual control.

Typically, the autopilot device may employ conventional path planning algorithms such as a, D to determine a travel path along which the transfer of cargo is then automatically completed. However, when the number of the automatic driving devices in the environment is increased or the environment is dynamically changed, the conventional path planning algorithm has a very slow calculation speed, and is not favorable for real-time task planning, so that a phenomenon that a plurality of automatic driving devices are jammed easily occurs.

Disclosure of Invention

The application provides a control method and device of automatic driving equipment and a storage medium, which can solve the problem that a plurality of automatic driving equipment are easy to jam by adopting a traditional path planning algorithm in the related art. The technical scheme is as follows:

in one aspect, there is provided a control method of an automatic driving apparatus, the method including:

determining a first state sequence, the first state sequence comprising global environment information;

determining a position probability distribution through a first neural network model based on the first state sequence, wherein the position probability distribution is used for determining a current target point;

determining action spaces corresponding to a plurality of moments through a second neural network model based on second state sequences corresponding to the moments and the position probability distribution, and controlling the automatic driving equipment to run based on the action spaces corresponding to the moments so as to enable the automatic driving equipment to move to the current target point;

the second state sequence corresponding to each moment comprises local environment information around the position of the automatic driving equipment at each moment, and the action space corresponding to each moment is used for indicating the action to be executed at each moment.

In another aspect, there is provided a control apparatus of an automatic driving device, the apparatus including:

a first determining module, configured to determine a first state sequence, where the first state sequence includes global environment information;

the second determination module is used for determining the position probability distribution of the current target point through the first neural network model based on the first state sequence;

a third determining module, configured to determine, based on a second state sequence and the position probability distribution corresponding to multiple times, motion spaces corresponding to the multiple times through a second neural network model, and control the automatic driving device to travel based on the motion spaces corresponding to the multiple times, so that the automatic driving device moves to the current target point;

In another aspect, an autopilot device is provided that includes a detection sensor, a travel component, a processor, and a transceiver:

the detection sensor is used for detecting the environment to obtain local environment information;

the transceiver is used for receiving the global environment information;

the processor is used for determining the action which needs to be executed by the automatic driving equipment based on the local environment information detected by the detection sensor and the global environment information received by the transceiver, and controlling the travelling component to move according to the determined action.

As an example, the detection sensor is an image sensor, and the image sensor is configured to acquire an image of a surrounding environment as the local environment information.

As an example, the transceiver is configured to transmit the location information and/or the motion space of the device to other devices, and the transceiver is further configured to receive the location information and/or the motion space of other devices transmitted by other devices.

In another aspect, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method of controlling an autopilot device.

In another aspect, a computer program product is provided comprising instructions which, when run on a computer, cause the computer to perform the steps of the control method of an autopilot device as described above.

The technical scheme provided by the application can at least bring the following beneficial effects:

and determining a first state sequence, wherein the first state sequence comprises global environment information, and determining the position probability distribution of the current target point through a first neural network model based on the first state sequence. And determining action spaces corresponding to a plurality of moments through a second neural network model based on second state sequences and position probability distribution corresponding to the plurality of moments, wherein the action space corresponding to each moment is used for indicating an action to be executed at each moment, and controlling the automatic driving equipment to run based on the action spaces corresponding to the plurality of moments so as to enable the automatic driving equipment to move to the current target point. Wherein the second state sequence corresponding to each time comprises local environmental information around the position where the autonomous driving apparatus is located at each time. That is, the whole task is hierarchically processed through two independent neural network models to realize the control of the automatic driving equipment, so that the planning efficiency of the task can be improved, and the phenomenon that a plurality of automatic driving equipment are jammed is avoided.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic structural diagram of an autopilot apparatus provided in an embodiment of the present application;

fig. 2 is a flowchart of a control method of an automatic driving device according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a control device of an automatic driving apparatus according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Before describing the control method of the automatic driving device provided by the embodiment of the present application in detail, the application scenario and the implementation environment related to the embodiment of the present application are briefly described.

First, a brief description is given of an application scenario related to an embodiment of the present application.

At present, in an automatic warehousing system, an automatic driving device can automatically execute a task of planning a path by adopting a traditional path planning algorithm under the condition of no need of manual control, so as to automatically complete tasks such as express parcel sorting, warehouse shelf carrying and the like based on the planned path, thereby greatly reducing the labor cost of traditional warehousing logistics, and certainly, if the planned path has short path length, consumes less time, and can reasonably avoid obstacles and congestion, the planned path is better.

In an application scenario of a plurality of automatic driving devices, a traditional path planning algorithm is difficult to comprehensively consider space-time conflicts and mutual influences among paths at a planning level, so that a global optimal solution is difficult to obtain.

With the rapid development of artificial intelligence technology in recent years, deep reinforcement learning has been one of the most interesting research directions in the field of artificial intelligence, and has achieved many good results in the fields of games, robot control, and the like. In deep reinforcement learning, a neural network model may be used to model an environment, a state sequence (which may also be referred to as a state space) corresponding to environment information is acquired, an action space is determined based on the state sequence, and an action indicated by the action space is executed. At the same time, the environment will feed back each action of the agent (e.g., the autopilot device in this application) to maximize the future expected harvest that the agent can achieve in the current state by setting the objective function for the jackpot to assist the agent in taking more optimal behavior and actions in each state.

However, when the existing deep reinforcement learning is used to implement path planning of multiple automatic driving devices, an obvious disadvantage is that the neural network model is single in mode, so that the input information is not refined enough, and the learning efficiency is not high. In addition, in an environment including a plurality of autonomous devices, a single search pattern in a highly dimensional state space may slow down the convergence rate of deep reinforcement learning, and it may be difficult to obtain a good performance. In addition, in an actual task, the performance of deep reinforcement learning is affected by factors such as faults of automatic driving equipment and changes of the environment, and the adaptability and the generalization capability are poor. To this end, embodiments of the present application provide a control method for an automatic driving apparatus, which may overcome the above-mentioned existing drawbacks, and specific implementation may be referred to below.

Next, a brief description is given of an implementation environment related to the embodiments of the present application.

The implementation environment related to the embodiment of the application may include a plurality of automatic driving devices, and the method may be executed by any automatic driving device in the plurality of automatic driving devices, and as an example, the plurality of automatic driving devices may communicate with each other. In some embodiments, the autonomous device may also be referred to as an Automated Guided Vehicle (AGV), or may be referred to as an AGV, or may be referred to as a smart mobile device, or may be a smart robot, for example.

As an example, referring to fig. 1, each of the autopilot devices may be configured with a detection sensor 110, such as a laser radar, a millimeter wave radar, an infrared sensor, and the like, to acquire surrounding environmental information through the detection sensor. In addition, each autopilot device further includes a travel member 120, which may be a wheel or the like, and a processor 130 for determining an action to be performed based on the environmental information detected by the detection sensor 110, and controlling the travel member 120 to move based on the action. Further, each of the autopilot devices may further include a transceiver 140, where the transceiver 140 may be configured to receive global environment information, may also be configured to transmit information to other devices, such as transmitting own location information, and the like, and the transceiver 140 may also be configured to receive information transmitted by other devices, such as receiving location information transmitted by other autopilot devices, and the like. Also, each of the automatic driving apparatuses may further include an angle sensor, a speed sensor, and the automated guided vehicle may detect a current angle through the angle sensor and a current speed through the speed sensor. The transceiver 140 may also be used to transmit information to other devices, such as transmitting the motion space of the transceiver, and the transceiver 140 may also be used to receive information transmitted by other devices, such as receiving the motion space transmitted by other autopilot devices.

Further, the implementation environment may further include a central management system, where the central management system may communicate with the multiple automatic driving devices respectively, and may be configured to schedule the multiple automatic driving devices, and for each of the multiple automatic driving devices, for example, may report current status data of the central management system to the central management system, so that the central management system issues the current status data to other automatic driving devices in the multiple automatic driving devices.

After the application scenarios and implementation environments related to the embodiments of the present application are described, a detailed description will be given of a control method of an automatic driving apparatus according to the embodiments of the present application with reference to the drawings.

Referring to fig. 2, fig. 2 is a flowchart illustrating a control method for an automatic driving device according to an exemplary embodiment, where the method may be applied to any of the automatic driving devices, and the method may include the following implementation steps:

step 201: a first sequence of states is determined, the first sequence of states including global context information.

Wherein the global environment information may be understood as some kind of glanceable information of the environment in which the autonomous driving apparatus is located. As an example, the global environment information includes at least one of a global map size, position information of a table, area information of an area where a shelf is located, position information of a fixed obstacle, and current position information of each of the autonomous devices in the environment.

For example, when the automatic driving device is located in a warehouse, the global map size may refer to the size of the warehouse, such as 10 × 10 meters.

The position information of the worktable can be used for indicating the actual position of the worktable, such as longitude and latitude information, position coordinates and the like.

The area information of the area where the shelf is located may be used to indicate the area where the shelf is located, and for example, may include position information of four vertices of the area where the shelf is located.

The fixed obstacle may refer to an obstacle that is not easily moved, such as a wall or a pillar, and similarly, the position information of the fixed obstacle is used to indicate the actual position of the fixed obstacle.

In addition, since the environment typically includes a plurality of autonomous devices, the global environment information may also typically include current location information of the respective autonomous devices in the environment.

It is understood that, in the above-mentioned information, except that the current position information of each of the automatic driving apparatuses is dynamically changed, other information is static information, that is, the information is not changed in general.

The global environment information may be obtained in advance. As an example, as previously described, a central management system may typically be included in the implementation environment, in which case the global environment information may be sent by the central management system to the autonomous device.

As another example, the static information in the global environment information may be pre-stored in the respective automatic driving devices, and the current location information of the respective automatic driving devices may be sent to the automatic driving devices by the respective automatic driving devices directly or through an intermediate forwarding device. For example, some other autopilot device may directly send its current location information to an autopilot device that is closer to itself, and may also report its current location information to an intermediate forwarding device, which may forward the location information to an autopilot device that is farther from the certain autopilot device according to the location of the respective autopilot device. It is understood that if the certain other autonomous driving apparatus is closer to the autonomous driving apparatus, the location information may be directly obtained, and if the distance is further, the location information may be obtained by receiving the location information sent by the intermediate forwarding apparatus. For example, the intermediate forwarding device may be a device in a central management system, which is not limited in this embodiment of the application.

Here, the global environment information is represented as a discretized state value, resulting in a first state sequence.

As an example, there is periodicity in the first state sequence, and accordingly, the step 201 includes: a first sequence of states for the current cycle is determined. That is, the automatic driving device may periodically determine the first state sequence, and the time interval between every two adjacent periods may be a random value, that is, each time a random sampling may be performed within a certain threshold range, to obtain the duration of one period, wherein the threshold range may be set according to the actual requirement, and assuming that the period duration of the current period is represented by T, T may be represented by [ T [ [ T ] T_min，T_max]Randomly selected within a threshold range. Therefore, the period duration is determined by randomly sampling within the threshold range, so that the subsequent first neural network model can better adapt to the complex and dynamic environment, and the robustness and the generalization capability are improved.

Step 202: based on the first state sequence, a position probability distribution of the current target point is determined by a first neural network model.

As an example, the current target point refers to a point to be reached at the end time of the current cycle, i.e. the current target point is actually a child target point, not the final end point.

In the embodiment of the present application, two separate neural network models are used to complete the task of path planning, that is, the task is hierarchically processed, and in implementation, different subtasks of one task may be respectively allocated to the first neural network model and the second neural network model to implement the task. The first neural network model and the second neural network model are two independent neural network models, and they can be both neural network models with deep reinforcement learning ability.

Here, based on the first state sequence obtained above, a current target point to be reached by the automatic driving device at the end of the current cycle is determined through the first neural network model, and the current target point is a point on a path from the starting point to the end point. In an implementation, the first state sequence may be used as an input to the first neural network model, outputting a location probability distribution that is actually used to indicate a location point in the corresponding environment.

Step 203: and determining motion spaces corresponding to the multiple moments through a second neural network model based on second state sequences corresponding to the multiple moments and the position probability distribution, and controlling the automatic driving equipment to run based on the motion spaces corresponding to the multiple moments so as to enable the automatic driving equipment to move to the current target point.

The local environment information may be understood as more detailed environment information in a small area around the autopilot device, that is, different from the global environment information, the local environment information only includes environment information in a local area, and the environment information is updated comprehensively and in detail.

As an example, the local environment information includes position information of an obstacle within a target local area, which is an area that can be detected by a detection sensor of the automatic driving apparatus, position information of other automatic driving apparatuses, and at least one of start point position information, end point position information, a current angle, and a current speed of the automatic driving apparatus.

The start point position information and the end point position information of the automatic driving device may be preset, for example, may be specified by the central processing system, or may be specified by a user.

The current angle can be detected by an angle sensor, and the current speed can be detected by a speed sensor.

The motion indicated by the motion space can be any one of N forward steps, left turn, right turn, stop, M backward steps and the like, wherein N and M are integers greater than 1.

In the embodiment of the present application, the second neural network model makes a plurality of decisions within one cycle, that is, within the current cycle, motion spaces corresponding to a plurality of time instants are determined, so that based on the motion spaces corresponding to the plurality of time instants, it is determined what actions are respectively performed at the plurality of time instants to reach the current target point determined by the first neural network model.

As an example, determining, by the second neural network model, a specific implementation of the motion space corresponding to the multiple time instants based on the second state sequence corresponding to the multiple time instants in the current cycle and the position probability distribution may include: and in the current period, acquiring a second state sequence every specified time, and determining an action space corresponding to the current acquisition time through the second neural network model based on the acquired second state sequence and the position probability distribution every time the second state sequence is acquired.

The specified duration may be set according to actual needs of a user, or may be set by default by the automatic driving device, which is not limited in the embodiment of the present application.

After the first neural network model makes a decision once and outputs position probability distribution, the second neural network model makes a decision once every specified time, and the action required to be executed when the second neural network model moves from the current position to the current target point determined by the first neural network model is learned within the time (namely the current period) from the current time T to the time T + T.

The process is described next with specific examples. For example, assuming that the specified duration is a time step, the second neural network model obtains the surrounding local environment information at time t, generates a series of discrete state values, obtains a second state sequence corresponding to time t, and inputs the second state sequence and the position probability distribution corresponding to time t into the second neural network model, the second neural network model outputs an action space corresponding to time t, the action space corresponding to time t is used for indicating an action to be performed, and the autopilot device performs the action indicated by the action space corresponding to time t. The automatic driving equipment acquires surrounding local environment information at the time t +1, generates a series of discrete state values, obtains a second state sequence corresponding to the time t +1, inputs the second state sequence corresponding to the time t +1 into the second neural network model, determines an action space corresponding to the time t +1 based on the second state sequence corresponding to the time t +1 and a position probability distribution, and executes an action indicated by the action space corresponding to the time t + 1. By analogy, according to the implementation manner, until the action space corresponding to the T + T moment is determined based on the second state sequence corresponding to the T + T moment, the action indicated by the action space corresponding to the T + T moment is executed, and the second neural network model completes the decision and the subtask of the current period.

The motion space may be represented by an output probability, that is, there may be a mapping relationship between the motion and the probability, and according to the output probability, the corresponding motion may be determined.

Therefore, the whole task is effectively divided, the tasks corresponding to the global environment information and the local environment information are decoupled, the first neural network model focuses on a small amount of information, the second neural network model focuses on the field information which is more refined in the local division, different states are learned in a targeted mode, and effective learning in a complex environment is guaranteed.

Further, determining, by the second neural network model, an internal reward value obtained by performing a corresponding action indicated by the action space based on the second state sequence acquired at each time instant; and correspondingly storing the position probability distribution, the second state sequence corresponding to each moment, the action space and the internal reward value.

The internal reward value corresponding to each time can be a positive number or a negative number, and when the internal reward value is a negative value, a certain penalty is actually given.

That is, after the automatic driving device executes the motion space corresponding to each time, the internal reward value obtained by executing the motion indicated by the corresponding motion space based on the second state sequence acquired at each time may be determined by the second neural network model.

For example, if a stop motion is executed based on a motion space corresponding to a certain time, which indicates that there is a stop behavior during traveling to the current target point, a congestion phenomenon is likely to occur, and therefore a certain penalty can be given by the second neural network model, and at this time, the internal reward value corresponding to the time is a negative number. For another example, if a left-turn or right-turn motion is performed based on a motion space corresponding to a certain time, it means that a turn is required while traveling to the current target point, which also means that there is a high possibility of congestion, and therefore a certain penalty may be given by the second neural network model. For example, if a straight-ahead motion is performed based on a motion space corresponding to a certain time, it is described that there is no congestion at that time, and therefore a certain reward can be given by the second neural network model, and the internal reward value corresponding to that time is a positive number.

It should be noted that the internal reward value given by each action may be set according to actual requirements, which is not limited in the embodiment of the present application.

Further, after controlling the automatic driving device to travel based on the motion space corresponding to the plurality of times, the following operations may be performed: and determining the stored second state sequence, action space and internal reward value corresponding to each moment in the moments as a group of first training data to obtain the first training data corresponding to the moments, and updating the parameters of the second neural network model based on the first training data corresponding to the moments.

That is, as described above, the second state sequence, the motion space, and the internal prize value are associated with each time in one cycle, and when the subtasks in one cycle are completed, the automatic driving apparatus acquires these data at a plurality of times in the cycle as training data, i.e., first training data, and then updates the parameters of the second neural network model using the first training data.

As an example, the second neural network model includes a second policy network for outputting a corresponding action space based on the input second state sequence and the position probability distribution, and a second evaluation network for outputting a second state evaluation value based on the action space output by the second policy network.

That is, the second neural network model may include a second policy network and a second evaluation network, the second policy network may be used to determine the motion space, such as output as an output probability representing the motion space, and the second evaluation network may evaluate the performed motion and output a second state evaluation value according to the internal reward value corresponding to each time, the second state evaluation value may be used to indicate whether the performed motion is effective. In the updating process, the first training data may be input into the second neural network model, in which case, parameters of the second policy network and the second evaluation network need to be updated respectively, so as to update the second neural network model. As an example, the update may be performed as follows.

Wherein the parameters of the second policy network included in the second neural network model can be updated by the following formula (1):

wherein, theta₁Parameter, alpha, representing a second policy network₁Represents the update step size, can be set according to the actual requirement, and

is expressed in the pair theta₁The calculation of the gradient is carried out,

indicates that action a is selected in a second state sequence corresponding to time t_tOutput probability (motion space), V (S)_t) Indicating a second state evaluation value output by the second evaluation network at time t.

Wherein the parameters of the second evaluation network included in the second neural network model can be updated by the following formula (2):

δ₁＝r_t+γ₁V(S′_t)-V(S_t) (2)

wherein, delta₁For a second evaluation of the parameters of the network, r_tIndicating the internal prize value, y, corresponding to time t₁The discount factor can be set according to actual requirements. V (S)_t) Representing a second state evaluation value, V (S'_t) Indicating the second state evaluation value determined by the second evaluation network at time t + 1.

Further, after determining, by the second neural network model, an internal reward value obtained by performing an action of the corresponding action space indication based on the second state sequence acquired at each time, the following operations may be further performed: and carrying out summation operation on the internal reward value corresponding to each moment in the plurality of moments to obtain the internal total reward value of the current period, and meanwhile, moving to the current target point indicated by the position probability distribution under the first state sequence to obtain the external reward value. And determining a high-level total reward value according to the internal total reward value and the external reward value, and correspondingly storing the first state sequence, the position probability distribution and the high-level total reward value of the current period.

That is, for the first neural network model, after performing a decision, an external reward value for external environmental feedback is determined, where the external reward value may be zero. And, in addition, the second neural network model feeds back an internal total reward value to the first neural network model, which may be determined by the following equation (3), as an example:

wherein r is_t ^aAs a penalty that can represent the second neural network model decision giving the autopilot a stop action command at time t, r_t ^bThe penalty of the second neural network model decision to give the autopilot a left turn or right turn command at time t may be expressed.

The first neural network model may then determine a high level total reward value for the current period based on the external reward value and the internal total reward value. And then correspondingly storing the first state sequence, the position probability distribution and the high-layer total reward value of the current period.

Thus, since the external reward value from the external environment is typically sparse, such as may be obtained when the autonomous device travels to an end point, such as determined by the length or duration of the travel path, the external reward value may be zero for a certain period, which is not conducive to subsequent updates to the first neural network model. Therefore, the second neural network model feeds back an internal reward value for the first neural network model, so that subsequent updating of the first neural network model can be facilitated, the first neural network model can be converged quickly, a good strategy can be effectively learned, the first neural network model is guided to select a current target point with less congestion as much as possible, and the problem of congestion among multiple automatic driving devices is avoided.

Further, second training data corresponding to a plurality of periods are obtained, the second training data corresponding to each period comprise the first state sequence, the position probability distribution and the high-level total reward value of each period, and the parameters of the first neural network model are updated based on the second training data corresponding to the plurality of periods.

As described above, since the first state sequence, the position probability distribution, and the total high-level reward value are stored in advance for each period, in the path planning process, the automatic driving device acquires the first state sequence, the position probability distribution, and the total high-level reward value for a plurality of periods, for example, when the automatic driving device travels to an end point, the data of a plurality of periods may be acquired as training data, that is, second training data, and then the second training data may be input to the first neural network model to update the parameters of the first neural network model.

As an example, the first neural network model includes a first policy network for outputting the location probability distribution based on the input first state sequence and a first evaluation network for outputting a first state evaluation value based on the location probability distribution.

That is, the first neural network model may include a first policy network and a first evaluation network, in which case, parameters of the first policy network and the first evaluation network need to be updated, respectively, so as to update the first neural network model. As an example, the update may be performed as follows.

Wherein the parameters of the first policy network included in the first neural network model can be updated according to the following formula (4):

wherein, theta₂Parameter, alpha, representing a first policy network₂Represents the update step size, can be set according to the actual requirement, and

is expressed in the pair theta₂The calculation of the gradient is carried out,

indicates that action a is selected in a first state sequence corresponding to the T period_tOutput probability (position probability distribution), V (S)_T) Indicating the first state evaluation value output by the first evaluation network at T period.

Wherein the first neural network model comprises parameters of the first evaluation network which can be updated by the following formula (5):

δ₂＝r_T+γ₂V(S'_T)-V(S_T) (5)

wherein, delta₂For the first evaluation of the parameters of the network, r_TRepresenting the total reward value, gamma, of the upper layer corresponding to the T period₂The discount factor can be set according to actual requirements. V (V) ((_ST) Represents a first state evaluation value V (S ') determined by the first evaluation network in T period'_T) Indicating the first state evaluation value determined by the first evaluation network during the period T + 1.

As an example, when the number of updates to the first neural network model reaches a first number threshold, the first neural network model may not be updated, and in addition, when the number of updates to the second neural network model reaches a second number threshold, the second neural network model may not be updated.

The first time threshold value can be set according to actual requirements, and similarly, the second time threshold value can also be set according to actual requirements, which is not limited in the embodiments of the present application.

In the embodiment of the application, a first state sequence is determined, the first state sequence comprises global environment information, and based on the first state sequence, a position probability distribution of a current target point is determined through a first neural network model. And determining action spaces corresponding to a plurality of moments through a second neural network model based on second state sequences and position probability distribution corresponding to the moments, wherein the action space corresponding to each moment is used for indicating an action to be executed at each moment, and controlling the automatic driving equipment to run based on the action spaces corresponding to the moments so as to enable the automatic driving equipment to move to the current target point. Wherein the second state sequence corresponding to each time comprises local environmental information around the position where the autonomous driving apparatus is located at each time. That is, the whole task is hierarchically processed through two independent neural network models to realize the control of the automatic driving equipment, so that the planning efficiency of the task can be improved, and the phenomenon that a plurality of automatic driving equipment are jammed is avoided.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a control device of an automatic driving apparatus according to an embodiment of the present application, where the control device may include:

a first determining module 310, configured to determine a first state sequence, where the first state sequence includes global environment information;

a second determining module 320, configured to determine, based on the first state sequence, a position probability distribution of a current target point through a first neural network model;

a third determining module 330, configured to determine, based on the second state sequences corresponding to multiple times and the position probability distribution, motion spaces corresponding to the multiple times through a second neural network model, and control the automatic driving device to travel based on the motion spaces corresponding to the multiple times, so that the automatic driving device moves to the current target point;

In a possible implementation manner of the present application, the third determining module 330 is configured to:

in the current period, acquiring a second state sequence every specified time length;

and when the second state sequence is acquired every time, determining an action space corresponding to the acquisition time of this time through the second neural network model based on the second state sequence acquired this time and the position probability distribution.

In a possible implementation manner of the present application, the third determining module 330 is further configured to:

determining, by the second neural network model, an internal reward value obtained by performing a corresponding action indicated by the action space based on the second state sequence obtained at each time instant;

determining a first state evaluation value corresponding to each time through the second neural network model based on the internal reward value corresponding to each time;

and correspondingly storing the action space, the internal reward value and the first state evaluation value corresponding to each moment.

determining the stored motion space, the internal reward value and the first state evaluation value corresponding to each moment in the plurality of moments as a group of first training data to obtain the first training data corresponding to the plurality of moments;

and updating the parameters of the second neural network model based on the first training data corresponding to the plurality of moments.

In one possible implementation manner of the present application, the second determining module 320 is further configured to:

summing the internal reward values corresponding to each moment in the multiple moments to obtain the internal total reward value of the current period;

feeding back the internal total reward value to the first neural network model;

determining, by the first neural network model, an external reward value obtained by moving to a current target point indicated by the location probability distribution under the first state sequence;

determining a high-level total reward value according to the internal total reward value and the external reward value;

determining, by the first neural network model, a second state assessment value based on the high tier total reward value;

and correspondingly storing the position probability distribution, the high-level total reward value and the second state evaluation value of the current period.

acquiring second training data corresponding to a plurality of periods, wherein the second training data corresponding to each period comprises position probability distribution, a high-level total reward value and a second state evaluation value of each period;

and updating the parameters of the first neural network model based on second training data corresponding to the plurality of periods.

In one possible implementation of the present application,

the first neural network model includes a first policy network for outputting the position probability distribution based on an input first state sequence, and a first evaluation network for outputting a first state evaluation value based on the position probability distribution; and/or the presence of a gas in the gas,

the second neural network model includes a second policy network for outputting a corresponding action space based on the input second state sequence and the position probability distribution, and a second evaluation network for outputting a second state evaluation value based on the action space output by the second policy network.

In one possible implementation manner of the present application, the global environment information includes at least one of a global map size, position information of a workbench, area information of an area where a shelf is located, position information of a fixed obstacle, and current position information of each autonomous driving apparatus in an environment;

the local environment information includes position information of an obstacle in a target local area, position information of other autonomous driving apparatuses, and start point position information, end point position information, a current angle, and a current speed of the autonomous driving apparatuses, and the target local area is at least one of areas that can be detected by a detection sensor of the autonomous driving apparatuses.

It should be noted that: the control device of the automatic driving device provided in the above embodiment is only exemplified by the division of the above function modules when controlling the automatic driving device to run, and in practical applications, the above function distribution may be completed by different function modules according to needs, that is, the internal structure of the device may be divided into different function modules to complete all or part of the above described functions. In addition, the control device of the automatic driving device provided by the above embodiment and the control embodiment of the automatic driving device belong to the same concept, and the specific implementation process thereof is described in detail in the method embodiment and is not described herein again.

In some embodiments, a computer-readable storage medium is also provided, in which a computer program is stored, which, when being executed by a processor, carries out the steps of the control method of an autopilot device according to an embodiment described above. For example, the computer readable storage medium may be a ROM, a RAM, a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It is noted that the computer-readable storage medium referred to herein may be a non-volatile storage medium, in other words, a non-transitory storage medium.

It should be understood that all or part of the steps for implementing the above embodiments may be implemented by software, hardware, firmware or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The computer instructions may be stored in the computer-readable storage medium described above.

That is, in some embodiments, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the steps of the control method of an autopilot device described above.

The above-mentioned embodiments are provided not to limit the present application, and any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A control method of an automatic driving apparatus, characterized by comprising:

determining the position probability distribution of the current target point through a first neural network model based on the first state sequence;

2. The method of claim 1, wherein determining, by a second neural network model, an action space corresponding to a plurality of time instants based on the second state sequences corresponding to the plurality of time instants and the position probability distribution comprises:

3. The method of claim 2, wherein the method further comprises:

and correspondingly storing the position probability distribution, the second state sequence corresponding to each moment, the action space and the internal reward value.

4. The method of claim 3, wherein after controlling the autonomous driving apparatus to travel based on the action space corresponding to the plurality of times, further comprising:

determining a second state sequence, an action space, an internal reward value and the position probability distribution corresponding to each moment in the plurality of stored moments as a group of first training data to obtain the first training data corresponding to the plurality of moments;

5. The method of claim 3, wherein after determining, by the second neural network model, an internal reward value obtained by performing an action of the corresponding action space indication based on the second sequence of states obtained at each time, further comprising:

feeding back the internal total reward value to the first neural network model;

and correspondingly storing the first state sequence, the position probability distribution and the high-level total reward value of the current period.

6. The method of claim 5, wherein the method further comprises:

acquiring second training data corresponding to a plurality of periods, wherein the second training data corresponding to each period comprises a first state sequence, position probability distribution and a high-level total reward value of each period;

7. The method of any one of claims 1-6,

8. The method of any one of claims 1-6,

the global environment information comprises at least one of global map size, position information of a workbench, area information of an area where a goods shelf is located, position information of fixed obstacles and current position information of each automatic driving device in the environment;

9. A control apparatus of an automatic driving device, characterized in that the apparatus comprises:

10. The apparatus of claim 9, wherein the third determination module is to:

11. The apparatus of claim 10, wherein the third determination module is further for:

12. The apparatus of claim 11, wherein the third determination module is further configured to:

13. The apparatus of claim 11, wherein the second determining module is further for:

feeding back the internal total reward value to the first neural network model;

14. The apparatus of claim 13, wherein the second determining module is further configured to:

15. The apparatus according to any one of claims 9-14,

16. The apparatus of any one of claims 9-14,

17. An autopilot device, comprising a detection sensor, a travel component, a processor, and a transceiver:

the transceiver is used for receiving the global environment information;

18. The autopilot device of claim 17 wherein the detection sensor is an image sensor for capturing an image of the surroundings as the local environmental information.

19. An autopilot device according to claim 17 or 18, characterized in that the transceiver is adapted to transmit the position information and/or the motion space of the device to other devices, and the transceiver is further adapted to receive the position information and/or the motion space of other devices transmitted by other devices.