CN114167857A

CN114167857A - Control method and device of unmanned equipment

Info

Publication number: CN114167857A
Application number: CN202111315547.7A
Authority: CN
Inventors: 熊方舟; 夏华夏; 任冬淳; 丁曙光; 樊明宇
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2022-03-11
Anticipated expiration: 2041-11-08
Also published as: CN114167857B

Abstract

The specification discloses a control method and a control device for unmanned equipment, which can be applied to the technical field of unmanned driving, can extract and decouple the environmental characteristics of the environment where the unmanned equipment is located through a pre-trained self-encoder to obtain all decoupling characteristics, inputs all decoupling characteristics into a decision model obtained through pre-reinforcement learning, outputs a decision corresponding to the environmental characteristics, and controls the unmanned equipment to move to a destination according to the obtained decision. The method can accurately obtain the decoupled and interpretable decoupling characteristic based on the self-encoder according to the environmental characteristic corresponding to the environmental scene, accurately output the decision which should be controlled to be executed by the unmanned equipment under various environmental scenes through the decision model with generalization, and improve the rationality and accuracy of the decision.

Description

Control method and device of unmanned equipment

Technical Field

The specification relates to the technical field of unmanned driving, in particular to a control method and device of unmanned equipment.

Background

At present, in the technical field of unmanned driving, decision making is the key for controlling the safe movement of unmanned equipment in the environment, and the accuracy of the decision making influences the safety of the unmanned equipment.

In the prior art, environmental data is acquired through unmanned equipment, and a decision is determined according to the acquired environmental data and a rule made by a person. The artificially established rules include various environmental conditions that the unmanned equipment may encounter during the movement and corresponding decisions.

However, the environment in which the unmanned equipment is located during movement is very complex, the environment conditions which may be met are also complex and changeable, the artificially formulated rules are difficult to flexibly cope with the complex environment conditions, and the problem that accurate decisions are difficult to obtain based on the artificially formulated rules exists in the prior art.

Disclosure of Invention

The present specification provides a method and an apparatus for controlling an unmanned aerial vehicle, which partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a control method of an unmanned aerial vehicle device, including:

determining current environmental characteristics according to current motion data of the unmanned equipment, motion data of surrounding obstacles and a destination position of the unmanned equipment, wherein the motion data at least comprises a position and a speed;

inputting the environmental characteristics into an encoder in a pre-trained self-encoder, decoupling the environmental characteristics, and determining decoupling characteristics corresponding to the environmental characteristics, wherein the decoupling characteristics are used for representing obstacle position distribution of each lane and speed distribution of each lane obstacle and the unmanned equipment;

and inputting each decoupling characteristic into a decision model obtained by pre-reinforcement learning, determining a decision corresponding to the environmental characteristic, and controlling the unmanned equipment to move according to the decision.

Optionally, the self-encoder is trained by the following method:

dividing environmental data acquired in the running process of acquisition equipment into a plurality of environmental segments according to a preset time interval;

determining the environmental characteristics of each environmental segment according to the environmental data corresponding to the environmental segment and taking the environmental characteristics as a training sample;

according to the environmental characteristics of the environmental section, determining the position distribution of obstacles of each lane and the speed distribution of the obstacles of each lane and the acquisition equipment as labels of the training samples;

exchanging decoupling characteristics of at least part of training samples output by an encoder of the self-encoder according to the labels of the training samples, and inputting the decoupling characteristics into a decoder to obtain exchange reconstruction characteristics;

determining a loss according to at least the environmental characteristics of the training samples and the exchange reconstruction characteristics to adjust parameters of the self-encoder.

Optionally, determining the obstacle position distribution of each lane according to the environmental characteristics of the environmental section specifically includes:

according to the environmental characteristics of the environmental section, determining the parallel positions of the two sides of the acquisition equipment in the adjacent lanes of the lane where the acquisition equipment is located, and determining the obstacle distribution in the parallel positions as the parallel position distribution;

determining the interval between an obstacle in front of the acquisition equipment and the acquisition equipment in each lane according to the environmental characteristics, and determining lane interval distribution according to the interval corresponding to each lane;

and determining the position distribution of the obstacles of each lane according to the parallel position distribution and the lane interval distribution.

Optionally, determining the speed distribution of each lane obstacle and the collecting device specifically includes:

for each lane, determining a front target obstacle according to an obstacle in front of the acquisition equipment in the lane, and determining front speed characteristics of the lane and the acquisition equipment according to a speed comparison relation between the speed of the front target obstacle and the speed of the acquisition equipment;

determining a rear target obstacle according to an obstacle behind the acquisition equipment in the lane, and determining rear speed characteristics of the lane and the acquisition equipment according to a speed comparison relation between the speed of the rear target obstacle and the speed of the acquisition equipment;

and determining the speed distribution of the obstacles and the acquisition equipment of each lane according to the front speed characteristic and the rear speed characteristic of each lane.

Optionally, exchanging decoupling characteristics of at least part of the training samples output from the encoder of the self-encoder according to the label of each training sample, specifically including:

determining training samples at least partially identical to the labels of the training samples for each training sample to construct label association relations among the training samples;

determining each training sample group according to the label incidence relation, and determining the same partial label of each training sample in each training sample group as a target label aiming at each training sample group;

and taking the decoupling characteristic representing the target label as a target decoupling characteristic, and exchanging the target decoupling characteristic of each training sample in the training sample group.

Optionally, before exchanging the decoupling characteristics of at least part of the training samples output from the encoder of the encoder, the method further comprises:

and inputting each decoupling characteristic of each training sample into a decoder of the self-encoder aiming at each training sample, and determining a reconstruction characteristic corresponding to the training sample.

Optionally, determining the loss at least according to the environmental features and the exchange reconstruction features of the training samples, specifically including:

for each training sample, determining a reconstruction loss according to the environmental characteristics of the training sample and the difference between reconstruction characteristics, wherein the reconstruction loss represents the difference between the input and the output of the self-encoder;

determining exchange loss according to the difference between the environmental characteristics and the exchange reconstruction characteristics of the training sample, wherein the exchange loss represents the difference between the results of decoupling the same label by an encoder of the self-encoder;

the total loss is determined based on the reconstruction loss and the exchange loss of at least a portion of the training samples.

Optionally, after controlling the unmanned device to move according to the decision, the method further comprises:

re-determining the motion data of the unmanned equipment and surrounding obstacles so as to determine the distance and the relative speed of the unmanned equipment and the obstacles;

and determining the reward corresponding to the decision according to the determined distance and the relative speed, and adjusting the parameters of the decision model according to the maximum reward optimization goal.

Optionally, the determining the motion data of the unmanned aerial vehicle and surrounding obstacles again to determine the distance and the relative speed between the unmanned aerial vehicle and each obstacle specifically includes:

judging whether the unmanned equipment is collided or not;

if so, determining a penalty corresponding to the decision, stopping the current training process of the decision model, and re-determining environmental characteristics to train the decision model continuously;

if not, the motion data of the unmanned equipment and surrounding obstacles are determined again so as to determine the distance and the relative speed between the unmanned equipment and the obstacles.

Optionally, determining the reward corresponding to the decision according to the determined distance and the determined relative speed specifically includes:

re-determining a distance to be traveled between the unmanned device and the destination location;

determining collision time of the unmanned equipment and each obstacle according to the re-determined distance and relative speed of the unmanned equipment and each obstacle;

determining rewards corresponding to the decisions according to the collision time, the speed of the unmanned equipment and the distance to be traveled;

wherein the distance to be traveled is negatively correlated with the reward, the time to collision is positively correlated with the reward, the speed of the unmanned device is positively correlated with the reward, and the distance between the unmanned device and each obstacle is positively correlated with the reward.

determining a steering wheel angle change rate of the unmanned equipment and an acceleration of the unmanned equipment when the unmanned equipment is controlled according to the decision, and re-determining a distance to be traveled between the unmanned equipment and the destination position;

determining rewards corresponding to the decisions according to the redetermined speed of the unmanned equipment, the steering wheel rotation angle change rate, the acceleration, the redetermined distance to be traveled, the distance between the unmanned equipment and each obstacle and the relative speed between the unmanned equipment and each obstacle;

wherein, steering wheel corner rate of change with reward is negatively correlated, the acceleration with reward is negatively correlated, redetermined wait to travel the distance with reward is negatively correlated, redetermined the velocity of unmanned aerial vehicle equipment with reward is positive correlated, the distance of unmanned aerial vehicle equipment and each barrier with reward is positive correlated, the relative speed of unmanned aerial vehicle equipment and each barrier with reward is negatively correlated.

The present specification provides a control apparatus of an unmanned aerial vehicle, including:

the environment characteristic determining module is used for determining current environment characteristics according to current motion data of the unmanned equipment, motion data of surrounding obstacles and a destination position of the unmanned equipment, wherein the motion data at least comprises a position and a speed;

the decoupling module is used for inputting the environmental characteristics into an encoder in a pre-trained self-encoder, decoupling the environmental characteristics and determining decoupling characteristics corresponding to the environmental characteristics, wherein the decoupling characteristics are used for representing the position distribution of obstacles of each lane and the speed distribution of the obstacles of each lane and the unmanned equipment;

and the control module is used for inputting each decoupling characteristic into a decision model obtained by pre-reinforcement learning, determining a decision corresponding to the environmental characteristic, and controlling the unmanned equipment to move according to the decision.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described control method of an unmanned aerial device.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-described method of controlling an unmanned device when executing the program.

The technical scheme adopted by the specification can achieve the following beneficial effects:

in the control method of the unmanned aerial vehicle provided by the specification, the pre-trained self-encoder can be used for extracting and decoupling the environmental characteristics of the environment where the unmanned aerial vehicle is located to obtain each decoupling characteristic, each decoupling characteristic is input into a decision model obtained by pre-reinforcement learning, and a decision corresponding to the environmental characteristic is output, so that the unmanned aerial vehicle is controlled to move to a destination according to the obtained decision.

According to the method, the decoupling characteristic with interpretability can be accurately obtained based on the self-encoder according to the environmental characteristic corresponding to the environmental scene, the decision which should be controlled to be executed by the unmanned equipment under various environmental scenes is accurately output through the decision model with generalization, and the reasonability and the accuracy of the decision are improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

fig. 1 is a schematic flow chart of a control method of an unmanned aerial vehicle in the present specification;

FIG. 2 is a flow chart illustrating a method of training a self-encoder provided herein;

fig. 3 is a schematic diagram of a tag association relationship provided in the present specification;

FIG. 4 is a schematic diagram of a feature exchange provided herein;

FIG. 5 is a schematic illustration of a spacing provided herein;

FIG. 6 is a schematic flow chart diagram of a method for training a decision model provided herein;

FIG. 7 is a schematic diagram of a control arrangement for an unmanned aerial vehicle provided herein;

fig. 8 is a schematic structural diagram of an electronic device provided in this specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a control method of an unmanned aerial vehicle in this specification, and specifically includes the following steps:

s100: determining the current environmental characteristics according to the current motion data of the unmanned device, the motion data of surrounding obstacles and the destination position of the unmanned device, wherein the motion data at least comprises position and speed.

In this specification, the control method of the unmanned aerial vehicle may be executed by the unmanned aerial vehicle, or may be executed by a server. The server executes the method as an example to be described later. When the method is executed by a server, data of the unmanned aerial vehicle itself, which is a control object of the method, and data in an environment in which the unmanned aerial vehicle is located may be transmitted to the server by the unmanned aerial vehicle.

Since what decision is finally determined so that the unmanned aerial vehicle can be controlled to safely arrive at the destination according to the determined decision is related to the motion state of the unmanned aerial vehicle, the distribution of obstacles in the environment around the unmanned aerial vehicle and the motion state of the obstacles.

Thus, in one or more embodiments of the present description, the server may first determine motion data for various obstacles around the drone, current motion data for the drone, and a destination location corresponding to the destination of the drone. Wherein the motion data comprises at least a position and a velocity. That is, the motion data of the unmanned aerial vehicle includes at least the position of the unmanned aerial vehicle and the velocity of the unmanned aerial vehicle, and the motion data of each obstacle includes at least the position of the obstacle and the velocity of the obstacle.

The server may then determine environmental characteristics based on the current motion data of the drone, the motion data of surrounding obstacles, and the destination location of the drone.

Since not all obstacles in the environment affect the safety of the drone, and the distance between an obstacle in the environment and the drone is inversely related to the risk that the obstacle may bring to the drone, and the higher the speed of the drone and the speed of the obstacle, the higher the risk that the drone may face.

Therefore, in one or more embodiments of the present specification, the server may acquire the movement data of each obstacle within a preset range when acquiring the movement data of each obstacle.

In one or more embodiments of the present disclosure, the preset range may be set as needed, for example, may be set to 100m, and the server needs to determine obstacles within 100m of the unmanned device and motion data of the obstacles. The preset range may be a preset range in front of the unmanned aerial vehicle, or may also be a preset range in back, or the preset range may also be an area expanded by centering on the unmanned aerial vehicle, which is not limited herein.

In one or more embodiments of the present disclosure, since the drone may be divided into multiple lanes along a road in which the drone moves, the drone may also switch driving in different lanes during movement. Therefore, when determining each obstacle in the preset range of the unmanned device, each obstacle in the preset range of the front and back of the lane where the unmanned device is located can be determined according to the position of the unmanned device, and each obstacle in the preset range of the front and back of the unmanned device in the adjacent lanes on the left and right sides of the unmanned device, including the obstacle in the parallel position of the unmanned device in the adjacent lanes on the two sides of the lane where the unmanned device is located, can be determined. For convenience of description, the lane where the unmanned aerial vehicle is located is replaced by a "target lane", and adjacent lanes on both sides of the lane where the unmanned aerial vehicle is located are replaced by "adjacent lanes".

The parallel position refers to a position after the unmanned equipment is translated to an adjacent lane along a direction perpendicular to the lane. The adjacent lanes are used for referring to lanes on two sides of the target lane, and the number of the adjacent lanes is not limited in the specification. For example, the adjacent lanes may be a left lane and a right lane of the target lane, or two left lanes and two right lanes, which may be set as required.

In one or more embodiments of the present disclosure, specifically, the server may determine an obstacle located in front of the lane where the unmanned device is located and closest to the unmanned device, and an obstacle located behind the lane where the unmanned device is located and closest to the unmanned device, respectively. And determining an obstacle closest to the unmanned equipment within a preset range in front of the unmanned equipment, an obstacle closest to the unmanned equipment within a preset range behind the unmanned equipment and an obstacle in a parallel position corresponding to the adjacent lane in the adjacent lane.

In the following steps, the following description will be given by taking as an example that each obstacle around is within a preset range before and after the unmanned device, and the obstacle closest to the unmanned device in the target lane and the adjacent lane. That is, except for specific descriptions, the obstacles mentioned later refer to the obstacles which are located in the target lane and the adjacent lane and are closest to the unmanned device within a preset range around the unmanned device.

In one or more embodiments of the present disclosure, when the server determines the current environmental characteristics according to the motion data of each obstacle, the current motion data of the unmanned device, and the destination location, specifically, first, the server may determine the distance and the relative speed between each obstacle and the unmanned device according to the motion data of each obstacle and the motion data of the unmanned device, respectively. Namely, for each obstacle, determining the relative speed of the unmanned device and the obstacle according to the speed of the unmanned device and the speed of the obstacle, and determining the distance between the unmanned device and the obstacle according to the position of the unmanned device and the position of the obstacle.

After determining the distance and the relative speed between each obstacle and the unmanned device, the server may determine a Time To Collision (TTC) between each obstacle and the unmanned device according To the determined distance and relative speed between each obstacle and the unmanned device.

Since one of the purposes of the server to determine the decision to control the unmanned aerial vehicle is to safely control the unmanned aerial vehicle to arrive at the destination, the server can also determine the distance between the unmanned aerial vehicle and the destination as the distance to be traveled according to the position of the unmanned aerial vehicle and the position of the destination.

Finally, the server can determine the environmental characteristics of the current moment according to the distance between each obstacle and the unmanned equipment, the relative speed between each obstacle and the unmanned equipment, the collision time between each obstacle and the unmanned equipment, the distance between the unmanned equipment and the destination and the motion data of the unmanned equipment.

Of course, the environmental characteristics of the environment where the unmanned aerial vehicle is currently located may also be determined according to other data, for example, the environmental characteristics at the current time may also be determined according to the acceleration, the orientation of each obstacle, the acceleration, and the like of the unmanned aerial vehicle, which may be specifically set as required, and this specification is not limited herein.

In one or more embodiments of the present disclosure, the determined distance between each obstacle and the unmanned device, the relative speed between each obstacle and the unmanned device, the collision time between each obstacle and the unmanned device, the distance between the unmanned device and the destination, and the motion data of the unmanned device may be directly used as the environmental characteristics at the current time.

S102: inputting the environmental characteristics into an encoder in a pre-trained self-encoder, decoupling the environmental characteristics, and determining decoupling characteristics corresponding to the environmental characteristics, wherein the decoupling characteristics are used for representing obstacle position distribution of each lane and speed distribution of each lane obstacle and the unmanned equipment.

In one or more embodiments of the present disclosure, after determining the environmental characteristics, the server may input the environmental characteristics into an encoder in a pre-trained self-encoder, decouple the environmental characteristics, and determine decoupling characteristics corresponding to the environmental characteristics. Wherein each decoupling feature is used for characterizing the obstacle position distribution of each lane and the speed distribution of each lane obstacle and the unmanned equipment.

S104: and inputting each decoupling characteristic into a decision model obtained by pre-reinforcement learning, determining a decision corresponding to the environmental characteristic, and controlling the unmanned equipment to move according to the decision.

In one or more embodiments of the present disclosure, after determining each decoupling feature of the environmental feature, the server may input each decoupling feature into a decision model obtained by pre-reinforcement learning, determine a decision corresponding to the environmental feature, and control the unmanned device to move according to the decision.

In one or more embodiments of the present disclosure, the environmental characteristics correspond to a concept of a state (state) in the reinforcement learning domain, and the resulting decision corresponds to an action (action) performed in the state.

In one or more embodiments of the present description, the final determined decision may be that the drone should follow the lane of motion in the environmental characteristics of the environment in which the drone is currently located. For example, there are lanes a, b, and c in the same direction, the lane where the unmanned device is currently located is lane b, that is, the target lane is lane b, and if the output decision corresponds to lane a, the server may control the unmanned device to switch to move along lane a.

Based on the control method of the unmanned equipment shown in fig. 1, the environmental characteristics of the environment where the unmanned equipment is located can be subjected to characteristic extraction and decoupling through a pre-trained self-encoder to obtain each decoupling characteristic, each decoupling characteristic is input into a decision model obtained through pre-reinforcement learning, and a decision corresponding to the environmental characteristic is output, so that the unmanned equipment is controlled to move to a destination according to the obtained decision.

In one or more embodiments of the present disclosure, the self-encoder is trained using the method shown in fig. 2.

Fig. 2 is a schematic flowchart of a method for training a self-encoder in this specification, which specifically includes the following steps:

s200: and dividing the environmental data acquired in the driving process of the acquisition equipment into a plurality of environmental segments according to a preset time interval.

As can be seen from the implementation process of the above control method for the unmanned aerial vehicle, the method obtains a decision based on a self-encoder and a decision model. Wherein the decision model is used for outputting the decision, and the input of the decision model is the output of an encoder in the self-encoder.

In this specification, the role of the self-encoder is to: and outputting data convenient for learning of the decision model trained in a reinforcement learning mode, namely converting the environmental characteristics determined based on the complex environmental data into a form convenient for analysis and learning of the decision model.

In one or more embodiments of the present description, the method of training a self-encoder may be performed by a server.

The training samples used to train the self-encoder may be determined from environmental data previously acquired by the acquisition device. The collecting device may be an unmanned device, or a manned vehicle capable of collecting environmental data, or the like, or may be other devices, which is not limited herein. When the collecting device is an unmanned device, the unmanned device as the collecting device and the unmanned device controlled in the control method of the unmanned device may be the same unmanned device or may be different, and the present specification is not limited herein. The following description will be given taking as an example that the pickup device is the same as the unmanned aerial vehicle that is the control target of the control method of the unmanned aerial vehicle.

The environmental data, that is, the data in the environment acquired by the acquisition device in the driving process, may specifically include at least: the current movement data of the acquisition equipment, the movement data of surrounding obstacles and the destination position of the acquisition equipment are acquired at each moment in the driving process of the acquisition equipment.

Since the number, position, speed, and distance between the obstacle and the unmanned device in the environment change at least partially at each time when the unmanned device moves to the destination, and at least partial states of the unmanned device itself, such as speed, position, and distance from the destination, change. The drone constitutes a different environmental scene with each obstacle at different moments of the drone's motion. The duration of different environmental scenes is different, and each time corresponds to an environmental segment in the environmental scene.

In addition, when the unmanned device moves in the environment, the data which can be directly acquired and influences the decision of the unmanned device is the data of the unmanned device and each obstacle in the current environment. Therefore, in one or more embodiments of the present disclosure, when training the self-encoder, the server may divide the environmental data collected during the driving process of the collecting device into a plurality of environmental segments according to a preset time interval.

In one or more embodiments of the present description, the time interval for dividing the environment segment may be the same as the time interval for dividing the time instants. For example, the time interval may be 1s, 40s, 60s, and so on.

S202: and determining the environmental characteristics of each environmental segment according to the environmental data corresponding to the environmental segment and taking the environmental characteristics as a training sample.

In one or more embodiments of the present specification, after determining each environment segment, the server may determine, for each environment segment, an environment feature of the environment segment according to environment data corresponding to the environment segment, and use the environment feature of the environment segment as a training sample.

S204: and determining the position distribution of the obstacles of each lane and the speed distribution of the obstacles of each lane and the acquisition equipment according to the environmental characteristics of the environmental section, and taking the obstacle position distribution and the speed distribution of the obstacles and the acquisition equipment of each lane as labels of the training samples.

In one or more embodiments of the present disclosure, after determining the environmental characteristics of the environmental segment, the server may determine the attributes corresponding to the environmental characteristics according to the environmental characteristics. The attribute at least comprises the position distribution of obstacles of each lane and the speed distribution of the obstacles of each lane and the acquisition equipment.

In one or more embodiments of the present disclosure, after determining each environmental segment, the server may determine, according to the environmental characteristics of the environmental segment, obstacle position distribution of each lane and speed distribution of obstacles in each lane and the collecting device as labels of the training samples. That is, the attribute of each training sample is used as a label.

The server can determine the obstacle position distribution of each lane and the speed distribution of each lane obstacle and the acquisition equipment corresponding to the environmental characteristics according to the environmental characteristics of each environmental section, and uses the obtained obstacle position distribution of each lane and the speed distribution of each lane obstacle and the acquisition equipment as the labels of the training samples.

In one or more embodiments of the present disclosure, when determining the obstacle position distribution of each lane according to the environmental characteristics of the environmental section, the server may determine, according to the environmental characteristics of the environmental section, parallel positions on both sides of the collecting device in adjacent lanes of the lane where the collecting device is located, and determine the obstacle distribution in the parallel positions as parallel position distribution, to determine, according to the environmental characteristics, an interval between an obstacle in front of the collecting device in each lane and the collecting device, and determine lane interval distribution according to an interval corresponding to each lane. Then, the server can determine the obstacle position distribution of each lane according to the parallel position distribution and the lane interval distribution.

The lane interval distribution is used for representing the distance (interval) between an obstacle in front of the collecting equipment and the collecting equipment in each lane. The parallel position distribution is used for indicating whether obstacles exist at parallel positions of adjacent lanes on two sides of the acquisition equipment. For example, when the parallel position has an obstacle 1 and no obstacle 0, the parallel position distribution may be represented as binary 01 when the left parallel position of the acquisition device has no obstacle and the right parallel position has an obstacle. Or 1 decimal.

In one or more embodiments of the present disclosure, a spacing distribution characteristic may be further determined according to the lane spacing distribution, and the spacing distribution characteristic is used to indicate a lane in which the obstacle is most spaced from the acquisition device. And determining the obstacle position distribution of each lane according to the interval distribution characteristics and the parallel position distribution. For example, Gap0, Gap1, and Gap2 respectively indicate that the front obstacle in the left adjacent lane is spaced from the collecting device, the front obstacle in the target lane is spaced from the collecting device, and the front obstacle in the right adjacent lane is spaced from the collecting device. The spacing profile can be designated as 0 if Gap0 is maximized, 1 if Gap1 is maximized, and 2 if Gap2 is maximized.

In one or more embodiments of the present disclosure, when determining the speed distribution of the obstacles in each lane and the collecting device, the server may determine, for each lane, a front target obstacle according to an obstacle in the lane in front of the collecting device, and determine a front speed characteristic of the lane and the collecting device according to a speed comparison relationship between the speed of the front target obstacle and the speed of the collecting device. And determining a rear target obstacle according to the obstacle behind the acquisition equipment in the lane, and determining the rear speed characteristics of the lane and the acquisition equipment according to the speed comparison relation between the speed of the rear target obstacle and the speed of the acquisition equipment. And then, determining the speed distribution of the obstacles and the acquisition equipment of each lane according to the front speed characteristic and the rear speed characteristic of each lane.

The front speed feature is used for representing the speed magnitude relation between each obstacle in front of the collecting device and the collecting device in each lane, and the rear speed feature is used for representing the speed magnitude relation between each obstacle behind the collecting device and the collecting device in each lane. In the front speed characteristic, as the obstacle is in front of the acquisition equipment, the acquisition equipment is safer when the obstacle speed is larger, the speed of the obstacle in the corresponding lane is larger than that of the acquisition equipment and is recorded as 1, and the speed of the obstacle smaller than that of the acquisition equipment is recorded as 0. When there is no obstacle in front of any of the left adjacent lane, the target lane, and the right adjacent lane, the unmanned vehicle is safe, and therefore, it can be also written as 1. The forward speed feature represented by the binary number 111 may indicate that the speed of the obstacle ahead of the left adjacent lane, the target lane, and the right adjacent lane is greater than the speed of the acquisition device. Or the binary number 111 may be represented as a decimal 7.

In the rear speed feature, since the obstacle is behind the collecting device, the smaller the obstacle speed, the safer the collecting device is, the speed of the obstacle in the corresponding lane can be recorded as 1 when being smaller than the collecting device, and can be recorded as 0 when being larger than the collecting device. When there is no obstacle behind any of the left adjacent lane, the target lane, and the right adjacent lane, the unmanned vehicle is safe, and therefore, it can be also written as 1.

S206: and exchanging decoupling characteristics of at least part of training samples output by an encoder of the self-encoder according to the label of each training sample, and inputting the decoupling characteristics into a decoder to obtain exchange reconstruction characteristics.

In one or more embodiments of the present disclosure, after determining the label of each training sample, the server may exchange decoupling features of at least a part of the training samples output from the encoder of the encoder according to the label of each training sample, and input the decoupling features into the decoder to obtain an exchange reconstruction feature.

In one or more embodiments of the present specification, when the decoupling features of at least part of the training samples output from the encoder of the encoder are exchanged according to the labels of the training samples, the server may determine, for each training sample, training samples at least partially identical to the labels of the training samples to construct a label association relationship between the training samples, and determine, according to the label association relationship, training sample groups to determine, for each training sample group, partial labels identical to the training samples in the training sample group as target labels.

Then, the server may use the decoupling feature representing the target label as a target decoupling feature, and exchange the target decoupling features of the training samples in the training sample group. And aiming at each training sample, determining the exchange characteristics according to each decoupling characteristic after the training sample is exchanged. After the exchange characteristics of the training samples are obtained, the exchange characteristics of the training samples are respectively input into a decoder to determine the exchange reconstruction characteristics corresponding to the training samples.

Fig. 3 is a schematic diagram of a tag association relationship provided in this specification. As shown in the figure, each oblique line filled rectangle represents each training sample, and the training samples connected by the double-headed arrow are the training samples having the association relationship, and it can be seen that the upper rectangle and the lower left rectangle in fig. 3 are connected by the attribute a and the attribute B, that is, the attribute a and the attribute B of both rectangles are the same. The upper rectangle is connected with the lower right rectangle through an attribute D, namely the attribute D of the upper rectangle is the same as that of the lower right rectangle. The left lower rectangle and the right lower rectangle are connected through the attribute B and the attribute C, namely the attribute B and the attribute C of the left lower rectangle and the right lower rectangle are the same.

In one or more embodiments of the present specification, when determining the training sample group, the server may determine, for each training sample, another training sample that is at least partially labeled the same as the training sample according to the label association relationship, as an associated training sample of the training sample, use the same partial label as an associated label, and use an attribute corresponding to the associated label as an associated attribute. Then, the server can determine each training sample group according to the training sample, the determined associated training samples and the preset grouping value. The grouping value is used to indicate the number of training samples obtained from each training sample component, and may be specifically set according to needs, for example, the grouping value may be 2, that is, one training sample group may include 2 training samples, or the grouping value may also be 3 or other training samples, and the present specification is not limited herein.

Taking the grouping value of 2 as an example for explanation, assuming that the associated attribute is lane interval distribution, the server may exchange two training samples in the training sample group with the decoupling feature for representing the lane interval distribution, and after the exchange, splice the decoupling features of the training samples for each training sample to obtain the exchange feature of the training sample.

Taking the grouping value of 3 as an example, it is assumed that the correlation attribute is parallel position distribution, and the training sample group X includes training samples X1, X2, and X3. The server can exchange decoupling features corresponding to parallel position distribution in decoupling features of a training sample X1 to a training sample X2, exchange decoupling features corresponding to parallel position distribution in decoupling features of a training sample X2 to a training sample X3, and exchange decoupling features corresponding to parallel position distribution in decoupling features of a training sample X3 to a training sample X1, so that decoupling feature exchange is achieved. After the exchange, the server may splice the decoupling features of the training samples for each training sample in the training sample group X to obtain the exchange feature of the training sample.

In one or more embodiments of the present description, different decoupling features correspond to different attributes, that is, different attributes are used for representing different attributes, so when the decoupling features exchanged by the training samples in each training sample group are spliced, the decoupling features can be spliced according to a preset sequence.

Fig. 4 is a schematic diagram of a feature exchange provided in the present specification. As shown in the figure, the grouping value is 2, rectangles a1, a2, A3 and a4 are decoupling features of a training sample group X and a training sample X1, and rectangles B1, B2, B3 and B4 are decoupling features of a training sample group X and a training sample X2, wherein the rectangles filled with oblique lines represent decoupling features corresponding to lane interval distribution, the rectangles filled with grids represent decoupling features corresponding to front speed features, the rectangles filled with horizontal lines represent decoupling features corresponding to rear speed features, and the rectangles filled with vertical lines represent decoupling features corresponding to parallel position distribution. Because the decoupling feature A1 of the training sample X1 used for representing the lane interval distribution is the same as the decoupling feature B1 of the training sample X2 used for representing the lane interval distribution, the A1 and the B1 are exchanged, and after the exchange, the server splices the decoupling features of the training samples to respectively obtain the exchange features corresponding to the training samples. Where C1 represents the cross-over feature of training sample X1 and C2 represents the cross-over feature of training sample X2. It can be seen that the order of splicing the decoupling features after the training samples are exchanged is the same, and the decoupling features are distributed at intervals of lanes, front speed features, rear speed features and parallel positions.

In one or more embodiments of the present description, the server may exchange multiple identical decoupling features together when multiple identical decoupling features exist between training samples in the same training sample set.

In addition, in one or more embodiments of the present specification, before exchanging the decoupling features of at least part of the training samples output by the encoder of the self-encoder, the server may further input, for each training sample, each decoupling feature of the training sample into the decoder of the self-encoder, and determine a reconstruction feature corresponding to the training sample. So that losses are determined from the reconstructed features in subsequent steps.

S208: determining a loss according to at least the environmental characteristics of the training samples and the exchange reconstruction characteristics to adjust parameters of the self-encoder.

In one or more embodiments of the present disclosure, the server may determine the loss according to at least the environmental characteristics of the training samples and the exchange reconstruction characteristics, and adjust the parameters of the self-encoder with the goal of minimizing the loss.

In one or more embodiments of the present description, in determining the loss, the server may determine the loss for each training sample based on differences between the environmental characteristics of the training sample and the exchange reconstruction characteristics.

In one or more embodiments of the present specification, in determining the loss, the server may further determine, for each training sample, a reconstruction loss according to the difference between the environmental features and the reconstruction features of the training sample, and determine an exchange loss according to the difference between the environmental features and the exchange reconstruction features of the training sample. Then, a total loss is determined based on the reconstruction loss and the exchange loss of at least a portion of the training samples.

Wherein the reconstruction loss characterizes a difference between the self-encoder input and output. I.e. the difference between the reconstructed features characterizing the environmental features input to the self-encoder and the environmental features output from the self-encoder.

In this specification, the self-encoder is trained according to the reconstruction loss, and the purpose is to correctly restore the environment feature after inputting the decoupled feature obtained by decoupling the environment feature from the encoder of the self-encoder into the decoder. I.e. the aim is to make the reconstruction characteristics output from the decoder of the encoder the same as the environmental characteristics input to the encoder. And the decoupling accuracy can ensure the reconstruction accuracy, and when the difference between the reconstruction characteristics and the environmental characteristics is smaller, the decoupling characteristics obtained by decoupling the environmental characteristics by the encoder of the self-encoder are more accurate, and the decoupling characteristics obtained by decoupling can accurately represent the environmental characteristics.

This exchange loss characterizes the difference between the results of the decoupling of the same tag from the encoder of the encoder. In other words, decoupling features corresponding to the same labels as other training samples in all decoupling features of the training samples obtained by an encoder of an auto-encoder are exchanged with the other training samples, and after the decoupling features are reconstructed by a decoder, the difference between the environmental features and the exchange reconstruction features of the training samples is obtained. By targeting a minimum difference between the environmental characteristics input into the self-encoder and the exchange reconstruction characteristics output from the self-encoder, a minimum difference between the results of the decoupling of the same tag by the encoder of the self-encoder can be achieved, which is another goal of the self-encoder being trained by the present specification.

Moreover, based on the training of the self-encoder by the exchange loss, decoupling characteristics obtained by decoupling the encoder of the self-encoder from environmental characteristics are mutually independent, and the decoupling characteristics obtained by decoupling the encoder in the self-encoder from the training sample have stable and accurate corresponding relations with the labels of the training sample.

In one or more embodiments of the present description, after determining the total loss, the server may adjust parameters of the self-encoder with the goal of minimizing the total loss.

In one or more embodiments of the present disclosure, in determining the total loss, the number of training samples participating in determining the total loss may be set according to needs, and the present disclosure is not limited herein. For example, a total loss may be determined based on the reconstruction loss and the exchange loss of each training sample in a training sample set, or determined based on the reconstruction loss and the exchange loss of each training sample in a plurality of training samples.

In one or more embodiments of the present disclosure, taking the example of determining the total loss once according to the reconstruction loss and the exchange loss of each training sample in one training sample group, when determining the total loss according to the reconstruction loss and the exchange loss of each training sample, the server may determine, for each training sample group, the reconstruction loss and the exchange loss corresponding to each training sample in the training sample group, and determine the total loss according to each reconstruction loss and each exchange loss corresponding to the training sample group.

In one or more embodiments provided herein, the formula for determining the reconstruction loss may be specified as follows:

Loss1＝‖S_i-S′_i‖²

where Loss1 denotes the reconstruction Loss of the ith training sample, S_iRepresenting the ith training sample, i.e., the ith environmental feature, S'_iRepresenting the reconstructed features of the ith training sample.

In one or more embodiments provided herein, the formula for determining the exchange loss may be specified as follows:

wherein Loss2 denotes exchange Loss, S_iRepresents the ith training sample in a training sample group, i.e. the ith environmental characteristic, S ″_iRepresenting the cross-over reconstruction characteristics of the ith training sample in the set of training samples. n represents the number of training samples in the set of training samples.

In addition, in step S204 of this specification, when determining the lane interval distribution, if there are two lanes in which the distance between the obstacle and the collecting device is the same, the server may determine whether the lane in which the collecting device is located is included in the two lanes, if so, determine that the distance (interval) between the obstacle corresponding to the lane in which the collecting device is located and the collecting device is the largest, and set the distance between the obstacle corresponding to the lane in which the collecting device is located and the collecting device as the determined lane interval distribution.

For example, assume that the lane includes a, b, c, in an environmental scene a, the acquisition device is in lane b, in front of which there is an obstacle O in lane a₁In the lane b, there is an obstacle O₂In the lane c areObstacle O₃And O is₁At a distance of 10m, O from the collecting device₂At a distance of 20m, O from the collecting device₃The distance from the acquisition device is also 20 m. The server may determine that the distance of the obstacle in lane b from the collecting apparatus is the largest. When the interval distribution characteristics of the environmental scene A are shown, the characteristics can be represented by the identification of the lane b, for example, 1 represents the lane where the acquisition device is located, 0 represents the left lane of the acquisition device, and 2 represents the right lane of the acquisition device. The interval distribution characteristic of the environmental scene a can be represented by 0.

In one or more embodiments provided in this specification, when determining each attribute, specifically, the server may determine the interval distribution characteristic of the environmental scene according to the position of the acquisition device and the position of each obstacle included in the environmental data of the environmental scene, and determine the speed distribution of the environmental scene according to the speed of the acquisition device and the speed of each obstacle included in the environmental data. And determining the parallel positions of adjacent lanes of the acquisition equipment, judging whether obstacles exist in each parallel position, and determining lane interval distribution according to the judgment result.

In one or more embodiments of the present specification, the determination result of whether or not an obstacle exists in the parallel position may be represented by 1 and 0, for example, the presence of an obstacle in the parallel position may be represented by 1, and the absence of an obstacle in the parallel position may be represented by 0. (1, 0) indicates that there is an obstacle in the left parallel position and no obstacle in the right parallel position of the unmanned aerial vehicle, and (1, 1) indicates that there is an obstacle in both the left parallel position and the right parallel position of the unmanned aerial vehicle.

In one or more embodiments of the present specification, the lane spacing distribution of the environmental scene may be represented by a decimal number, for example, for the judgment result of (1, 0), the decimal number 2 may be used, and (1, 1) the decimal number 3 may be used. The server may determine a decimal value of the determination result as the lane interval distribution.

In one or more embodiments of the present disclosure, when there is no adjacent lane on one side or both sides of the drone, taking the case of no adjacent lane on the right side of the drone, the fact that no adjacent lane on the right side means that the right side is not drivable, and similar to the fact that there is an obstacle on the right side, the front speed feature and the rear speed feature of the obstacle in the adjacent lane on the side where there is no obstacle, and the lane interval distribution may be determined according to default values. Each default value may be set as desired, and the description is not limited herein. When there is no adjacent lane on one or both sides of the collecting device, an obstacle in the adjacent lane on the non-existing side may also be represented as a default value. For example, if 1 indicates that an obstacle is present, 1 may indicate that the obstacle is not present in the adjacent lane.

In one or more embodiments of the present disclosure, the server may determine, according to the foremost end of the body of the collecting device, a position line perpendicular to the lane direction where the foremost end of the body of the collecting device is located, and use a distance between an obstacle in each lane and the position line as the distance, i.e., the interval, of the collecting device.

Fig. 5 is a schematic view of a space provided in the present specification. As shown in the figure, gray filled rectangles represent unmanned devices as the acquisition devices, and white filled rectangles represent obstacles. The horizontal dashed line represents a position line in which the obstacle is spaced from the position line by G1 in the left adjacent lane of the drone and by G2 and G1 is greater than G2 in the right adjacent lane of the drone. In the lane where the unmanned aerial vehicle is located, no obstacle exists, and when the interval distribution characteristic is determined, the purpose is to determine the lane where the unmanned aerial vehicle is maximally spaced from the obstacle, and the larger the interval is, the larger the movement space of the unmanned aerial vehicle is, and the safer the unmanned aerial vehicle is, so that although G1 is larger than G2, since no obstacle exists in the lane where the unmanned aerial vehicle is located, the lane where the unmanned aerial vehicle is maximally spaced from the obstacle can be determined as the lane where the unmanned aerial vehicle is located.

In one or more embodiments of the present description, the server may also train the decision model through deep reinforcement learning.

Fig. 6 is a flowchart illustrating a method for training a decision model provided in the present specification. The method flow for training the decision model can comprise the following steps:

s300: determining the current environmental characteristics according to the current motion data of the unmanned device, the motion data of surrounding obstacles and the destination position of the unmanned device, wherein the motion data at least comprises position and speed.

In one or more embodiments of the present description, the decision model may be deeply learned based on a trained encoder by simulating the motion of the unmanned device in the environment. The server can obtain a motion track of the unmanned equipment in the motion process, a state (namely, environmental characteristics) corresponding to the motion track at each moment and an action (namely, an executed decision) corresponding to the state at each moment through simulation. And calculating the reward corresponding to each action. After the state of the unmanned equipment is determined, the server can input the state of the unmanned equipment into the encoder to obtain each decoupling characteristic with interpretability, and then input each decoupling characteristic into the decision model to obtain a decision, namely an action, corresponding to the state.

In one or more embodiments of the present description, first, the server may determine a current environmental characteristic from current motion data of the drone, the motion data of surrounding obstacles, and a destination location of the drone, the motion data including at least a location and a speed.

S302: inputting the environmental characteristics into an encoder in a pre-trained self-encoder, decoupling the environmental characteristics, and determining decoupling characteristics corresponding to the environmental characteristics, wherein the decoupling characteristics are used for representing obstacle position distribution of each lane and speed distribution of each lane obstacle and the unmanned equipment.

In one or more embodiments of the present disclosure, after determining the environmental characteristics, the server may input the environmental characteristics into an encoder in a pre-trained self-encoder, decouple the environmental characteristics, and determine decoupling characteristics corresponding to the environmental characteristics, where the decoupling characteristics are used to characterize obstacle position distribution of each lane and speed distribution of obstacles and the unmanned device of each lane.

The training process of the self-encoder may refer to the foregoing steps, and this description is not repeated herein.

S304: and inputting each decoupling characteristic into a decision model to be trained, determining a decision corresponding to the environmental characteristic, and controlling the unmanned equipment to move according to the decision.

In one or more embodiments of the present disclosure, after obtaining the decoupling features, the server may input the decoupling features into a decision model to be trained, and control the unmanned device according to a decision output by the decision model.

S306: and re-determining the motion data of the unmanned device and surrounding obstacles so as to determine the distance and the relative speed of the unmanned device and the obstacles.

In one or more embodiments of the present description, after controlling the unmanned device according to the decision output by the decision model, the server may re-determine the motion data of the unmanned device and surrounding obstacles to determine the distance and relative speed of the unmanned device from the obstacles.

Since the purpose of controlling the unmanned aerial vehicle according to the decision is: the unmanned equipment is driven to the destination safely, and danger possibly occurring in the driving process is avoided.

Therefore, in one or more embodiments of the present disclosure, after controlling the unmanned device according to the decision output by the decision model, the server may further determine whether the unmanned device collides, and if so, determine a penalty corresponding to the decision, stop training the decision model, and re-determine the environmental characteristics to train the decision model. If not, the motion data of the unmanned equipment and surrounding obstacles are determined again so as to determine the distance and the relative speed between the unmanned equipment and the obstacles.

S308: and determining the reward corresponding to the decision according to the determined distance and the relative speed, and adjusting the parameters of the decision model according to the maximum reward optimization goal.

In one or more embodiments of the present disclosure, after determining the distance and the relative speed between the unmanned device and each obstacle, the server may determine a reward corresponding to the decision based on at least the determined distance and relative speed, and adjust the parameters of the decision model with the reward being at most an optimization goal.

In one or more embodiments of the present disclosure, when determining the reward corresponding to the decision, the server may re-determine the distance to be traveled between the unmanned device and the destination location, and determine the collision time between the unmanned device and each obstacle according to the re-determined distance and relative speed between the unmanned device and each obstacle. Determining the reward corresponding to the decision according to the collision time, the speed of the unmanned equipment and the distance to be traveled;

wherein the distance to be traveled is negatively correlated to the reward since the purpose of the drone is to travel safely to a destination. Since the drone is more dangerous when the time to collision is smaller, the time to collision is positively correlated with the reward. And the speed of the unmanned device is positively correlated with the reward, and the distance between the unmanned device and each obstacle is positively correlated with the reward.

In one or more embodiments of the present disclosure, in order to enable the unmanned aerial vehicle to smoothly perform a decision to safely arrive at a destination without colliding with an obstacle, when determining a reward corresponding to the decision, the server may further determine a steering wheel angle change rate of the unmanned aerial vehicle and an acceleration of the unmanned aerial vehicle when controlling the unmanned aerial vehicle according to the decision, and re-determine a distance to be traveled between the unmanned aerial vehicle and the destination location.

The server may then determine an award corresponding to the decision based on the re-determined speed of the drone, the rate of change of steering wheel angle, the acceleration of the drone, the re-determined distance to be traveled, the distance of the drone from each obstacle, and the relative speed of the drone from each obstacle.

The steering wheel rotation angle change rate is negatively related to the reward, the acceleration is negatively related to the reward, the re-determined distance to be traveled is negatively related to the reward, the re-determined speed of the unmanned device is positively related to the reward, and the distance between the unmanned device and each obstacle is positively related to the reward. Since the distance of the obstacle from the unmanned device is more stable and the probability of collision is less when the relative speed of the obstacle to the unmanned device is smaller, the relative speed of the unmanned device to each obstacle is inversely related to the reward.

Of course, the reward method for calculating the decision provided in this specification is only an example, and the reward may be determined according to one or more combinations of the collision time of the unmanned device with each obstacle, the distance to be traveled, the distance between the unmanned device and each obstacle, the relative speed between the unmanned device and each obstacle, the speed of the unmanned device, and the like, or may be calculated by other methods, and this specification is not limited herein.

In this specification, a decoder in an autoencoder converts a state into an interpretable decoupling characteristic, which can help a decision model to train, and improve training efficiency and performance of a trained model.

In one or more embodiments of the present disclosure, the reward function corresponding to the decision model may be as follows:

r＝r₁+r₂+r₃

r₂＝-l+v

wherein r represents a reward function, r₁Indicating a security-related reward, r₂Indicating a reward related to the efficiency of movement to the destination, r₃Representing a reward associated with the smoothness of the movement. l denotes the distance of the unmanned device from the destination. v denotes that the unmanned equipment has performed a blockThe programmed speed, acc represents the acceleration of the unmanned device after the decision is made,

indicating the rate of change of steering wheel angle at which the drone performs the decision.

In one or more embodiments of the present description, after the server controls the drone to execute the decision, if the drone is not collided, r is₁＝d_p+ ttc, after the server controls the unmanned equipment to execute the decision, if the unmanned equipment collides, r₁-w. And w is a positive number, and can be set as desired, for example, can be set to 100.

In one or more embodiments of the present disclosure, an optimization goal may be determined according to a ppo (formal Policy optimization) algorithm, and parameters of the decision model may be optimized according to the optimization goal.

In one or more embodiments of the present description, the optimization goal of the decision model may be expressed as:

J＝E_t[min(r_t(θ)A_t,clip(r_t(θ),1-ε,1+ε)A_t)]

wherein J is an optimization objective, theta is a strategy parameter of the decision model to be optimized, E_tFor the expected value at time t, ε is a hyper-parameter that limits the policy update amplitude, clip is a clipping function that is used to clip r_tThe value of (theta) is clipped to the range of (1-epsilon, 1+ epsilon), r_t(theta) is the ratio of the strategy at the time t of each iterative update of the decision model to the old strategy at the time t of the last iterative update, namely

π(a_t|s_t) Denotes the policy at each iterative update, π_old(a_t|s_t) Representing the old policy at the last iteration update. A. the_tRepresents a merit function calculated based on the reward function r, and

A_t＝δ_t+(γλ)δ_t+1+…+(γλ)^T-t+1δ_T-1

δ_t＝r_t+γV(s_t+1)-V(s_t)

wherein γ is a preset first discounting factor, λ is a preset first discounting factor according to GAE (general Advantage estimation) algorithm, and δ_tAnd T is a time difference error value at the moment T, and T is the total duration of the collected motion trail of the unmanned equipment. V(s)_t) I.e. the value function for time t.

In one or more embodiments of the present disclosure, ε may have a value of 0.2 and γ may have a value of 0.95.

In one or more embodiments of the present disclosure, a Stochastic Gradient Descent (SGD) method may be used to maximize an optimization target J of the PPO algorithm to update the neural network weights θ, each time based on N motion trajectories (NT-step-total data) of parallel training with a total duration T (i.e., time step T).

Based on the same idea, the present specification also provides a corresponding control device of the unmanned aerial vehicle, as shown in fig. 7.

Fig. 7 is a schematic diagram of a control device of an unmanned aerial vehicle provided in the present specification, the control device including:

an environment feature determination module 400, configured to determine a current environment feature according to current motion data of an unmanned device, motion data of surrounding obstacles, and a destination location of the unmanned device, where the motion data at least includes a location and a speed;

a decoupling module 401, configured to input the environmental features into an encoder in a pre-trained self-encoder, decouple the environmental features, and determine decoupling features corresponding to the environmental features, where the decoupling features are used to represent obstacle position distribution of each lane and speed distribution of each lane obstacle and the unmanned equipment;

the control module 402 is configured to input each decoupling characteristic into a decision model obtained by pre-reinforcement learning, determine a decision corresponding to the environmental characteristic, and control the unmanned equipment to move according to the decision;

the device further comprises:

the training module 403 is configured to divide environmental data acquired during a running process of an acquisition device into a plurality of environmental segments according to a preset time interval, determine, for each environmental segment, an environmental characteristic of the environmental segment according to the environmental data corresponding to the environmental segment, and use the environmental characteristic as a training sample, determine, according to the environmental characteristic of the environmental segment, obstacle position distribution of each lane and speed distribution of each lane obstacle and the acquisition device, use the environmental characteristic as a label of the training sample, exchange decoupling characteristics of at least part of training samples output from an encoder of the self-encoder according to the label of each training sample, input the decoupling characteristics into a decoder to obtain exchange reconstruction characteristics, and determine a loss according to at least the environmental characteristic and the exchange reconstruction characteristics of the training samples to adjust parameters of the self-encoder.

Optionally, the training module 403 is further configured to determine parallel positions on two sides of the acquisition device in adjacent lanes of the lane where the acquisition device is located according to the environmental characteristics of the environmental segment, determine obstacle distribution in the parallel positions, as parallel position distribution, determine an interval between an obstacle in front of the acquisition device in each lane and the acquisition device according to the environmental characteristics, determine lane interval distribution according to an interval corresponding to each lane, and determine obstacle position distribution of each lane according to the parallel position distribution and the lane interval distribution.

Optionally, the training module 403 is further configured to, for each lane, determine a front target obstacle according to an obstacle in front of the collecting device in the lane, determine front speed features of the lane and the collecting device according to a speed comparison relationship between the speed of the front target obstacle and the speed of the collecting device, determine a rear target obstacle according to an obstacle behind the collecting device in the lane, determine rear speed features of the lane and the collecting device according to a speed comparison relationship between the speed of the rear target obstacle and the speed of the collecting device, and determine speed distributions of the obstacles in the lanes and the collecting device according to the front speed features and the rear speed features of the lanes.

Optionally, the training module 403 is further configured to determine, for each training sample, each training sample that is at least partially identical to a label of the training sample, so as to construct a label association relationship between the training samples, determine each training sample group according to the label association relationship, determine, for each training sample group, a partial label that is identical to each training sample in the training sample group, use the partial label as a target label, use a decoupling feature that characterizes the target label as a target decoupling feature, and exchange the target decoupling features of each training sample in the training sample group.

Optionally, the training module 403 is further configured to, for each training sample, input each decoupling feature of the training sample into a decoder of the self-encoder, and determine a reconstruction feature corresponding to the training sample.

Optionally, the training module 403 is further configured to, for each training sample, determine a reconstruction loss according to an environmental characteristic of the training sample and a difference between reconstruction characteristics, where the reconstruction loss characterizes a difference between an input and an output of the self-encoder, determine a switching loss according to the environmental characteristic of the training sample and a difference between switching reconstruction characteristics, where the switching loss characterizes a difference between results of decoupling of the encoder of the self-encoder from the same tag, and determine a total loss according to a reconstruction loss and a switching loss of at least a part of the training sample.

The apparatus further comprises: an adjusting module 404, configured to re-determine motion data of the unmanned device and surrounding obstacles, to determine a distance and a relative speed between the unmanned device and each obstacle, and determine a reward corresponding to the decision according to the determined distance and relative speed, and adjust a parameter of the decision model with the reward being a maximum optimization goal.

The adjusting module 404 is further configured to determine whether the unmanned device collides, determine a penalty corresponding to the decision if the unmanned device collides, stop a current training process of the decision model, and re-determine environmental characteristics to continue training the decision model, and if the unmanned device does not collide, re-determine motion data of the unmanned device and surrounding obstacles to determine distances and relative speeds between the unmanned device and the obstacles.

Optionally, the adjusting module 404 is further configured to redetermine a distance to be traveled between the unmanned aerial vehicle and the destination location, determine a collision time between the unmanned aerial vehicle and each obstacle according to the redetermined distance and relative speed between the unmanned aerial vehicle and each obstacle, and determine a reward corresponding to the decision according to the collision time, the speed of the unmanned aerial vehicle, and the distance to be traveled, wherein the distance to be traveled is negatively correlated with the reward, the collision time is positively correlated with the reward, the speed of the unmanned aerial vehicle is positively correlated with the reward, and the distance between the unmanned aerial vehicle and each obstacle is positively correlated with the reward.

Optionally, the adjusting module 404 is further configured to determine a steering wheel angle change rate of the unmanned device and an acceleration of the unmanned device when the unmanned device is controlled according to the decision, and re-determine a distance to be traveled between the unmanned device and the destination location, determine a reward corresponding to the decision according to the re-determined speed of the unmanned device, the steering wheel angle change rate, the acceleration, the re-determined distance to be traveled, a distance between the unmanned device and each obstacle, and a relative speed between the unmanned device and each obstacle, wherein the steering wheel angle change rate is negatively related to the reward, the acceleration is negatively related to the reward, the re-determined distance to be traveled is related to the negative reward, and the re-determined speed of the unmanned device is positively related to the reward, the distance between the unmanned device and each obstacle is positively correlated with the reward, and the relative speed between the unmanned device and each obstacle is negatively correlated with the reward.

The present specification also provides a computer-readable storage medium storing a computer program that is operable to execute the above-described control method of the unmanned aerial vehicle.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 8. As shown in fig. 8, at the hardware level, the electronic device includes a processor, an internal bus, a memory, and a non-volatile memory, but may also include hardware required for other services. The processor reads a corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the control method of the unmanned equipment.

Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A control method of an unmanned aerial vehicle, characterized by comprising:

2. The method of claim 1, wherein the self-encoder is trained using:

3. The method of claim 2, wherein determining the obstacle position distribution of each lane according to the environmental characteristics of the environmental segment comprises:

4. The method of claim 2, wherein determining a velocity profile of each lane obstacle and the collection device comprises:

5. The method of claim 2, wherein exchanging the decoupling characteristics of at least a portion of the training samples output from the encoder of the self-encoder according to the labels of the training samples comprises:

6. The method of claim 5, wherein prior to exchanging the decoupling characteristics of at least a portion of the training samples output from the encoder of the encoder, the method further comprises:

7. The method of claim 6, wherein determining the loss based at least on the environmental characteristics of the training samples and the trade reconstruction characteristics comprises:

8. The method of claim 1, wherein after controlling the unmanned device to move in accordance with the decision, the method further comprises:

9. The method of claim 8, wherein re-determining the motion data of the drone and surrounding obstacles to determine the distance and relative velocity of the drone from the obstacles comprises:

judging whether the unmanned equipment is collided or not;

10. The method of claim 8, wherein determining the reward corresponding to the decision based on the determined distance and relative speed comprises:

11. The method of claim 8, wherein determining the reward corresponding to the decision based on the determined distance and relative speed comprises:

12. A control apparatus of an unmanned aerial vehicle, characterized by comprising:

13. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1 to 11.

14. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 11 when executing the program.