CN114167857B

CN114167857B - Control method and device of unmanned equipment

Info

Publication number: CN114167857B
Application number: CN202111315547.7A
Authority: CN
Inventors: 熊方舟; 夏华夏; 任冬淳; 丁曙光; 樊明宇
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2023-12-22
Anticipated expiration: 2041-11-08
Also published as: CN114167857A

Abstract

The specification discloses a control method and a device of unmanned equipment, which can be applied to the technical field of unmanned equipment, environmental characteristics of the environment where the unmanned equipment is located can be subjected to characteristic extraction and decoupling through a pre-trained self-encoder to obtain decoupling characteristics, the decoupling characteristics are input into a decision model obtained by pre-reinforcement learning, and decisions corresponding to the environmental characteristics are output so as to control the unmanned equipment to move towards a destination according to the obtained decisions. The decoupling characteristic with the interpretability of the decoupling can be accurately obtained based on the self-encoder according to the environmental characteristics corresponding to the environmental scene, and the decision which is executed by the unmanned equipment and is controlled under various environmental scenes can be accurately output through the decision model with the generalization, so that the rationality and the accuracy of the decision are improved.

Description

Control method and device of unmanned equipment

Technical Field

The present disclosure relates to the field of unmanned technologies, and in particular, to a method and an apparatus for controlling an unmanned device.

Background

At present, in the technical field of unmanned aerial vehicle, decision making is a key for controlling safe movement of unmanned aerial vehicle equipment in an environment, and the accuracy of the decision making influences the safety of the unmanned aerial vehicle equipment.

In the prior art, environmental data is acquired by an unmanned device, and decisions are determined according to the acquired environmental data and manually formulated rules. The rules formulated by the human comprise various environmental conditions possibly encountered by the unmanned equipment in the movement process and corresponding decisions.

However, the environment where the unmanned device is located when moving is extremely complex, the possibly encountered environmental conditions are complex and changeable, the manually formulated rules are difficult to flexibly cope with the complex environmental conditions, and the prior art has the problem that accurate decisions are difficult to obtain based on the manually formulated rules.

Disclosure of Invention

The present disclosure provides a method and apparatus for controlling an unmanned device, so as to partially solve the above-mentioned problems in the prior art.

The technical scheme adopted in the specification is as follows:

the present specification provides a control method of an unmanned apparatus, comprising:

determining current environmental characteristics according to current motion data of unmanned equipment, motion data of surrounding obstacles and destination positions of the unmanned equipment, wherein the motion data at least comprises positions and speeds;

inputting the environmental characteristics into an encoder in a pre-trained self-encoder, decoupling the environmental characteristics, and determining decoupling characteristics corresponding to the environmental characteristics, wherein the decoupling characteristics are used for representing obstacle position distribution of each lane and speed distribution of each lane obstacle and the unmanned equipment;

And inputting each decoupling characteristic into a decision model obtained by reinforcement learning in advance, determining a decision corresponding to the environmental characteristic, and controlling the unmanned equipment to move according to the decision.

Optionally, the self-encoder is trained by the following method:

dividing environmental data acquired in the running process of the acquisition equipment into a plurality of environmental segments according to a preset time interval;

for each environmental segment, determining the environmental characteristics of the environmental segment according to the environmental data corresponding to the environmental segment, and taking the environmental characteristics as a training sample;

according to the environmental characteristics of the environmental segments, determining the obstacle position distribution of each lane and the speed distribution of each lane obstacle and the acquisition equipment, and taking the obstacle position distribution and the speed distribution of each lane obstacle and the speed distribution of the acquisition equipment as labels of the training samples;

exchanging decoupling characteristics of at least part of training samples output by an encoder of the self-encoder according to labels of all training samples, and inputting the decoupling characteristics into a decoder to obtain exchanged reconstruction characteristics;

a loss is determined based at least on the environmental characteristics of the training samples and the exchange reconstruction characteristics to adjust the parameters of the self-encoder.

Optionally, determining the obstacle position distribution of each lane according to the environmental characteristics of the environmental segment specifically includes:

According to the environmental characteristics of the environmental segments, determining parallel positions on two sides of the acquisition equipment in adjacent lanes of the lane where the acquisition equipment is located, and determining the barrier distribution in the parallel positions as parallel position distribution;

according to the environmental characteristics, determining the interval between the barrier in front of the acquisition equipment and the acquisition equipment in each lane, and determining lane interval distribution according to the interval corresponding to each lane;

and determining the obstacle position distribution of each lane according to the parallel position distribution and the lane interval distribution.

Optionally, determining a speed distribution of each lane obstacle and the acquisition device specifically includes:

for each lane, determining a front target obstacle according to an obstacle in front of the acquisition equipment in the lane, and determining a front speed characteristic of the lane and the acquisition equipment according to a speed comparison relation of the front target obstacle and the acquisition equipment;

determining a rear target obstacle according to an obstacle behind the acquisition equipment in the lane, and determining rear speed characteristics of the lane and the acquisition equipment according to a speed comparison relation between the rear target obstacle and the acquisition equipment;

And determining the speed distribution of the obstacles of each lane and the acquisition equipment according to the front speed characteristics and the rear speed characteristics of each lane.

Optionally, exchanging the decoupling characteristics of at least part of the training samples output by the encoder of the self-encoder according to the labels of the training samples, specifically including:

for each training sample, determining each training sample which is at least partially identical to the training sample label so as to construct a label association relationship among the training samples;

determining each training sample group according to the label association relation, and determining partial labels with the same training samples in each training sample group as target labels;

and taking the decoupling characteristic representing the target label as a target decoupling characteristic, and exchanging the target decoupling characteristic of each training sample in the training sample group.

Optionally, before exchanging the decoupling characteristics of at least part of the training samples output from the encoder of the self-encoder, the method further comprises:

for each training sample, each decoupling characteristic of the training sample is input into a decoder of the self-encoder, and a reconstruction characteristic corresponding to the training sample is determined.

Optionally, determining the loss at least according to the environmental characteristics of the training samples and the exchange reconstruction characteristics specifically includes:

for each training sample, determining a reconstruction loss from differences between environmental features and reconstruction features of the training sample, the reconstruction loss characterizing differences between the input and output of the self-encoder;

determining a switching loss according to the difference between the environmental characteristics and the switching reconstruction characteristics of the training sample, wherein the switching loss represents the difference between the results of decoupling the same label by the encoder of the self-encoder;

the total loss is determined based on the reconstruction loss and the exchange loss of at least a portion of the training samples.

Optionally, after controlling the movement of the unmanned device according to the decision, the method further comprises:

re-determining motion data of the unmanned device and surrounding obstacles to determine a distance and a relative speed of the unmanned device from the obstacles;

determining rewards corresponding to the decisions according to the determined distance and the determined relative speed, and adjusting parameters of the decision model by taking the maximum rewards as an optimization target.

Optionally, the motion data of the unmanned device and surrounding obstacles are redetermined to determine the distance and the relative speed between the unmanned device and the obstacles, including:

Judging whether the unmanned equipment collides or not;

if yes, determining a penalty corresponding to the decision, stopping the current training process of the decision model, and re-determining environmental characteristics to train the decision model continuously;

if not, the movement data of the unmanned device and surrounding obstacles are re-determined to determine the distance and the relative speed between the unmanned device and the obstacles.

Optionally, determining the rewards corresponding to the decision according to the determined distance and the relative speed specifically includes:

re-determining a distance to be travelled between the unmanned device and the destination location;

determining collision time of the unmanned equipment and each obstacle according to the re-determined distance and relative speed between the unmanned equipment and each obstacle;

determining rewards corresponding to the decisions according to the collision time, the speed of the unmanned equipment and the distance to be driven;

the distance to be driven is inversely related to the reward, the collision time is positively related to the reward, the speed of the unmanned device is positively related to the reward, and the distance between the unmanned device and each obstacle is positively related to the reward.

determining a steering wheel angle change rate of the unmanned equipment and an acceleration of the unmanned equipment when the unmanned equipment is controlled according to the decision, and redetermining a distance to be travelled between the unmanned equipment and the destination position;

determining rewards corresponding to the decisions according to the redetermined speed of the unmanned equipment, the steering wheel angle change rate, the accelerated speed, the redetermined distance to be driven, the distance between the unmanned equipment and each obstacle and the relative speed between the unmanned equipment and each obstacle;

the steering wheel rotation angle change rate is inversely related to the rewards, the acceleration is inversely related to the rewards, the redetermined distance to be driven is inversely related to the rewards, the redetermined speed of the unmanned equipment is positively related to the rewards, the distance between the unmanned equipment and each obstacle is positively related to the rewards, and the relative speed between the unmanned equipment and each obstacle is inversely related to the rewards.

The present specification provides a control device of an unmanned apparatus, comprising:

the environment characteristic determining module is used for determining the current environment characteristic according to the current motion data of the unmanned equipment, the motion data of all surrounding obstacles and the destination position of the unmanned equipment, wherein the motion data at least comprises position and speed;

the decoupling module is used for inputting the environmental characteristics into an encoder in a pre-trained self-encoder, decoupling the environmental characteristics, and determining decoupling characteristics corresponding to the environmental characteristics, wherein the decoupling characteristics are used for representing the barrier position distribution of each lane and the speed distribution of each lane barrier and the unmanned equipment;

the control module is used for inputting all decoupling characteristics into a decision model obtained by reinforcement learning in advance, determining a decision corresponding to the environmental characteristics, and controlling the unmanned equipment to move according to the decision.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the control method of the unmanned device described above.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the control method of the unmanned device described above when executing the program.

The above-mentioned at least one technical scheme that this specification adopted can reach following beneficial effect:

according to the control method of the unmanned equipment, the environmental characteristics of the environment where the unmanned equipment is located can be subjected to characteristic extraction and decoupling through the pre-trained self-encoder to obtain the decoupling characteristics, the decoupling characteristics are input into the decision model obtained through reinforcement learning in advance, and the decision corresponding to the environmental characteristics is output, so that the unmanned equipment is controlled to move towards a destination according to the obtained decision.

According to the method, according to the environmental characteristics corresponding to the environmental scenes, the decoupling characteristics with the interpretability can be accurately obtained based on the self-encoder, and decisions which are executed by unmanned equipment and are controlled under various environmental scenes can be accurately output through the decision model with the generalization, so that the rationality and the accuracy of the decisions are improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification, illustrate and explain the exemplary embodiments of the present specification and their description, are not intended to limit the specification unduly. In the drawings:

Fig. 1 is a schematic flow chart of a control method of an unmanned device in the present specification;

FIG. 2 is a flow chart of a method of training a self-encoder provided in the present specification;

fig. 3 is a schematic diagram of a label association relationship provided in the present specification;

FIG. 4 is a schematic illustration of a feature exchange provided herein;

FIG. 5 is a schematic illustration of an interval provided herein;

FIG. 6 is a flow chart of a method for training a decision model provided in the present specification;

fig. 7 is a schematic view of a control device of the unmanned apparatus provided in the present specification;

fig. 8 is a schematic structural diagram of an electronic device provided in the present specification.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the present specification more apparent, the technical solutions of the present specification will be clearly and completely described below with reference to specific embodiments of the present specification and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present specification. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present specification with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a control method of an unmanned device in the present specification, specifically including the following steps:

s100: and determining the current environmental characteristics according to the current movement data of the unmanned equipment, the movement data of surrounding obstacles and the destination position of the unmanned equipment, wherein the movement data at least comprises position and speed.

In this specification, the control method of the unmanned apparatus may be executed by the unmanned apparatus, or may be executed by a server. The method performed by the server will be described below as an example. When the method is executed by a server, data of the unmanned device itself, which is a control object of the method, and data in an environment in which the unmanned device is located may be transmitted by the unmanned device to the server.

Since it is finally determined what kind of decision to enable the unmanned device to be controlled to safely arrive at the destination according to the determined decision, it is related to the movement state of the unmanned device, the distribution of obstacles in the surrounding environment of the unmanned device and the movement state of the obstacles.

Thus, in one or more embodiments of the present description, the server may first determine movement data for each obstacle around the drone, current movement data for the drone, and a destination location for the destination of the drone. Wherein the motion data comprises at least a position and a velocity. That is, the movement data of the unmanned aerial vehicle at least includes the position of the unmanned aerial vehicle and the speed of the unmanned aerial vehicle, and the movement data of each obstacle at least includes the position of the obstacle and the speed of the obstacle.

The server may then determine environmental characteristics based on current movement data of the unmanned device, movement data of surrounding obstacles, and a destination location of the unmanned device.

Since not all obstacles in the environment affect the safety of the unmanned device, and the distance between the obstacle in the environment and the unmanned device is inversely related to the risk that the obstacle may bring to the unmanned device, and the higher the speed of the unmanned device and the higher the speed of the obstacle, the higher the risk that the unmanned device may face.

Thus, in one or more embodiments of the present disclosure, the server may obtain movement data of each obstacle within a preset range when obtaining movement data of each obstacle.

In one or more embodiments of the present disclosure, the preset range may be set as required, for example, may be set to 100m, and the server needs to determine each obstacle having a distance within 100m from the unmanned device and the movement data of each obstacle. The preset range may be a preset range in front of the unmanned aerial vehicle, or may be a preset range in rear of the unmanned aerial vehicle, or may be an area expanded around the unmanned aerial vehicle, and the present specification is not limited thereto.

In one or more embodiments of the present disclosure, since the unmanned device may be divided into multiple lanes along the road of movement, the unmanned device may also switch between lanes to travel during movement. Therefore, when determining each obstacle in the preset range of the unmanned device, each obstacle in the preset range of the unmanned device in front and back of the lane where the unmanned device is located can be determined according to the position of the unmanned device, and each obstacle in the preset range of the unmanned device in front and back of the unmanned device in the adjacent lanes on the left and right sides of the unmanned device can be determined, including the obstacle in the parallel position of the unmanned device in the adjacent lanes on the two sides of the lane where the unmanned device is located. For convenience of description, the lane where the unmanned device is located is replaced with the "target lane", and the adjacent lanes on both sides of the lane where the unmanned device is located are replaced with the "adjacent lanes" for description.

The parallel position refers to a position after the unmanned equipment is translated to an adjacent lane along a direction perpendicular to the lane. The adjacent lanes are used to refer to lanes on both sides of the target lane, and the present specification does not limit the number of adjacent lanes. For example, the adjacent lanes may be one lane on the left side and one lane on the right side of the target lane, or two lanes on the left side, two lanes on the right side, or the like, and may be specifically set as required.

In one or more embodiments of the present disclosure, in particular, the server may determine an obstacle in front of a lane where the unmanned device is located that is closest to the unmanned device and an obstacle in back of the lane where the unmanned device is closest to the unmanned device, respectively. And determining, for each adjacent lane, an obstacle nearest to the unmanned device within a preset range in front of the unmanned device, an obstacle nearest to the unmanned device within a preset range behind the unmanned device, and an obstacle in a parallel position corresponding to the adjacent lane in the adjacent lane.

In the following step, the following description will be given taking, as an example, that each obstacle around is the closest obstacle to the unmanned device in the target lane and the adjacent lanes within the preset range around the unmanned device. That is, the obstacles mentioned later refer to the closest obstacle to the unmanned device within a predetermined range around the unmanned device in the target lane and the adjacent lanes, except for the specific description.

In one or more embodiments of the present disclosure, when determining the current environmental characteristic according to the movement data of each obstacle, the current movement data of the unmanned device, and the destination location, specifically, first, the server may determine the distance and the relative speed between each obstacle and the unmanned device according to the movement data of each obstacle and the movement data of the unmanned device, respectively. That is, for each obstacle, the relative speed of the unmanned device and the obstacle is determined according to the speed of the unmanned device and the speed of the obstacle, and the distance between the unmanned device and the obstacle is determined according to the position of the unmanned device and the position of the obstacle.

After determining the distance and the relative speed of each obstacle from the unmanned device, the server may determine the collision time (Time To Collision, TTC) of each obstacle with the unmanned device according to the determined distance and relative speed of each obstacle from the unmanned device, respectively.

Since the server determines that one of the purposes of the decision to control the unmanned device is to safely control the unmanned device to reach the destination, the server may also determine the distance between the unmanned device and the destination as the distance to be travelled according to the location of the unmanned device and the destination location.

Finally, the server may determine environmental characteristics at a current time based on a distance between each obstacle and the unmanned device, a relative speed between each obstacle and the unmanned device, a collision time between each obstacle and the unmanned device, a distance between the unmanned device and the destination, and movement data of the unmanned device.

Of course, the environmental characteristics of the current environment of the unmanned device may also be determined according to other data, for example, the environmental characteristics of the current moment may also be determined according to the acceleration, the direction of each obstacle, the acceleration, etc. of the unmanned device, and may be specifically set according to needs, which is not limited herein.

In one or more embodiments of the present disclosure, the determined distance between each obstacle and the unmanned device, the relative speed between each obstacle and the unmanned device, the collision time between each obstacle and the unmanned device, the distance between the unmanned device and the destination, and the movement data of the unmanned device may be directly used as the environmental characteristics of the current time.

S102: and inputting the environmental characteristics into an encoder in a pre-trained self-encoder, decoupling the environmental characteristics, and determining decoupling characteristics corresponding to the environmental characteristics, wherein the decoupling characteristics are used for representing the obstacle position distribution of each lane and the speed distribution of each lane obstacle and the unmanned equipment.

In one or more embodiments of the present disclosure, after determining an environmental feature, the server may input the environmental feature into an encoder in a pre-trained self-encoder, decouple the environmental feature, and determine each decoupling feature corresponding to the environmental feature. Wherein each decoupling feature is used to characterize the obstacle location profile of each lane and the speed profile of each lane obstacle and the unmanned device.

S104: and inputting each decoupling characteristic into a decision model obtained by reinforcement learning in advance, determining a decision corresponding to the environmental characteristic, and controlling the unmanned equipment to move according to the decision.

In one or more embodiments of the present disclosure, after determining each decoupling feature of the environmental feature, the server may input each decoupling feature into a decision model obtained by reinforcement learning in advance, determine a decision corresponding to the environmental feature, and control the movement of the unmanned device according to the decision.

In one or more embodiments of the present description, the environmental features correspond to state (state) concepts of the reinforcement learning field, and the resulting decisions correspond to actions performed in the state.

In one or more embodiments of the present description, the finalized decision may be that the unmanned device should follow a lane of movement under the environmental characteristics of the environment in which the unmanned device is currently located. For example, there are equidirectional lanes a, b and c, the lane where the unmanned device is currently located is lane b, that is, the target lane is lane b, and if the output decision corresponds to lane a, the server may control the unmanned device to switch to move along lane a.

Based on the control method of the unmanned equipment shown in fig. 1, through a pre-trained self-encoder, the environmental characteristics of the environment where the unmanned equipment is located can be subjected to characteristic extraction and decoupling, so as to obtain various decoupling characteristics, the various decoupling characteristics are input into a decision model obtained through pre-reinforcement learning, and decisions corresponding to the environmental characteristics are output, so that the unmanned equipment is controlled to move towards a destination according to the obtained decisions.

In one or more embodiments of the present disclosure, the self-encoder is trained using the method shown in FIG. 2.

Fig. 2 is a flow chart of a method for training a self-encoder in the present specification, specifically including the following steps:

s200: and dividing the environmental data acquired in the running process of the acquisition equipment into a plurality of environmental segments according to a preset time interval.

The execution of the method for controlling the unmanned aerial vehicle device can be seen, and the method is based on the self-encoder and the decision model to obtain the decision. Wherein the decision model is used for outputting decisions, and the input of the decision model is the output of the encoder in the self-encoder.

In this specification, the function of the self-encoder is: outputting data which is convenient for decision model learning and is trained by a reinforcement learning mode, namely converting environmental characteristics determined based on complex environmental data into a form which is convenient for decision model analysis and learning.

In one or more embodiments of the present description, the method of training a self-encoder may be performed by a server.

The training samples used to train the self-encoder may be determined from environmental data pre-acquired by the acquisition device. The acquisition device may be an unmanned device, or a manned vehicle capable of acquiring environmental data, etc., but may be other devices, as well, and the description is not limited herein. When the collecting apparatus is an unmanned apparatus, the unmanned apparatus as the collecting apparatus and the unmanned apparatus controlled in the control method of the unmanned apparatus may be the same unmanned apparatus or may be different, and the present specification is not limited thereto. The following description will be given taking the same example of the collection device as the unmanned device as the control object of the control method of the unmanned device.

The environmental data, that is, the data in the environment acquired by the acquisition device during the running process, specifically may at least include: the current motion data of the acquisition equipment, the motion data of surrounding obstacles and the destination position of the acquisition equipment, which are acquired by the acquisition equipment at all times in the driving process.

Since for the unmanned device at each moment of the movement of the unmanned device towards the destination, at least part of the number, position, speed and distance between the obstacle and the unmanned device in the environment will vary, and at least part of the speed, position, distance from the destination etc. of the unmanned device itself will also vary. The unmanned device and the obstacles form different environmental scenes at different moments of the movement of the unmanned device. The duration of the different environmental scenes is different, and each moment corresponds to one environmental segment in the environmental scene.

In addition, when the unmanned device moves in the environment, the data which can be directly acquired and influence the decision of the unmanned device is the data of the unmanned device and each obstacle in the current environment. Thus, in one or more embodiments of the present disclosure, the server may divide environmental data collected during the running of the collection device into a plurality of environmental segments at preset time intervals while training the self-encoder.

In one or more embodiments of the present description, the time interval for dividing the environmental segments may be the same as the time interval for dividing the respective time instants. For example, the time interval may be 1s, 40s, 60s, etc.

S202: for each environmental segment, determining the environmental characteristics of the environmental segment according to the environmental data corresponding to the environmental segment, and taking the environmental characteristics as a training sample.

In one or more embodiments of the present disclosure, after determining each environmental segment, the server may determine, for each environmental segment, an environmental characteristic of the environmental segment according to environmental data corresponding to the environmental segment, and use the environmental characteristic of the environmental segment as a training sample.

S204: and determining the obstacle position distribution of each lane and the speed distribution of each lane obstacle and the acquisition equipment according to the environmental characteristics of the environmental segments, and taking the obstacle position distribution and the speed distribution of each lane obstacle and the acquisition equipment as the labels of the training samples.

In one or more embodiments of the present disclosure, after determining the environmental feature of the environmental segment, the server may determine, according to the environmental feature, an attribute corresponding to the environmental feature. Wherein the attribute comprises at least the obstacle position distribution of each lane and the speed distribution of each lane obstacle and the acquisition device.

In one or more embodiments of the present disclosure, after determining each environmental segment, the server may determine, according to the environmental characteristics of the environmental segment, a position distribution of the obstacle in each lane and a speed distribution of the obstacle in each lane and the collecting device, as a tag of a training sample. That is, the attribute of each training sample is used as a label.

The server can determine the obstacle position distribution of each lane and the speed distribution of each lane obstacle and the acquisition equipment corresponding to the environmental characteristics according to the environmental characteristics of each environmental segment, and takes the obtained obstacle position distribution of each lane and the speed distribution of each lane obstacle and the acquisition equipment as the label of the training sample.

In one or more embodiments of the present disclosure, when determining the obstacle location distribution of each lane according to the environmental characteristics of the environmental segment, the server may determine, according to the environmental characteristics of the environmental segment, parallel locations on both sides of the collecting device in adjacent lanes of the lane where the collecting device is located, and determine the obstacle distribution in the parallel locations, as the parallel location distribution, to determine, according to the environmental characteristics, a distance between the obstacle in front of the collecting device and the collecting device in each lane, and determine the lane interval distribution according to the distance corresponding to each lane. Then, the server can determine the obstacle position distribution of each lane according to the parallel position distribution and the lane interval distribution.

The lane interval distribution is used for representing the distance (interval) between the barrier in front of the acquisition device and the acquisition device in each lane. The parallel position distribution is used for indicating whether barriers exist at the parallel positions of the adjacent lanes at the two sides of the acquisition equipment. For example, the parallel position existence obstacle is written as 1, the non-existence obstacle is written as 0, and when the parallel position on the left side of the acquisition apparatus does not exist obstacle and the obstacle on the right side exists, the parallel position distribution can be expressed as 01 in binary. Or decimal 1.

In one or more embodiments of the present disclosure, a spacing profile may also be determined from the lane spacing profile, the spacing profile being used to represent a lane where an obstacle is most spaced from the acquisition device. And determining the obstacle position distribution of each lane according to the interval distribution characteristics and the parallel position distribution. For example, gap0, gap1, gap2 represent the distance between the front obstacle in the left adjacent lane and the collecting device, the distance between the front obstacle in the target lane and the collecting device, and the distance between the front obstacle in the right adjacent lane and the collecting device, respectively. If Gap0 is maximum, the spacing profile may be written as 0, if Gap1 is maximum, the spacing profile may be written as 1, and if Gap2 is maximum, the spacing profile may be written as 2.

In one or more embodiments of the present disclosure, in determining the speed profile of each lane obstacle and the collection device, the server may determine, for each lane, a front target obstacle from an obstacle in front of the collection device in the lane, and determine a front speed characteristic of the lane and the collection device from a speed comparison of the front target obstacle and the collection device. And determining a rear target obstacle according to an obstacle behind the acquisition equipment in the lane, and determining rear speed characteristics of the lane and the acquisition equipment according to a speed comparison relation between the rear target obstacle and the acquisition equipment. And then, determining the speed distribution of the obstacles and the acquisition equipment of each lane according to the front speed characteristic and the rear speed characteristic of each lane.

The front speed characteristic is used for representing the speed relation between each obstacle in front of the collecting device and the collecting device in each lane, and the rear speed characteristic is used for representing the speed relation between each obstacle behind the collecting device and the collecting device in each lane. In the forward speed feature, since the obstacle is in front of the collection device, the collection device is safer as the obstacle speed is larger, the obstacle speed in the corresponding lane can be recorded as 1 when the obstacle speed is larger than the collection device, and as 0 when the obstacle speed is smaller than the collection device. When there is no obstacle in front of any one of the left adjacent lane, the target lane, and the right adjacent lane, it is safe for the unmanned apparatus, and thus may be denoted as 1. The forward speed characteristic represented by bin 111 may be indicative of the forward obstacle speeds of the left adjacent lane, the target lane, and the right adjacent lane being greater than the acquisition device. Or binary number 111 may be represented as decimal 7.

In the rear speed feature, since the obstacle is behind the collection device, the smaller the obstacle speed, the safer the collection device, the smaller the obstacle speed in the corresponding lane is, the 1 is marked as the obstacle speed, and the 0 is marked as the obstacle speed is larger than the collection device. When there is no obstacle behind any one of the left adjacent lane, the target lane, and the right adjacent lane, it is safe for the unmanned apparatus, and thus may be denoted as 1.

S206: and exchanging the decoupling characteristics of at least part of training samples output by the encoder of the self-encoder according to the labels of the training samples, and inputting the decoupling characteristics into a decoder to obtain exchanged reconstruction characteristics.

In one or more embodiments of the present disclosure, after determining the label of each training sample, the server may exchange the decoupling characteristics of at least a portion of the training samples output from the encoder of the self-encoder according to the label of each training sample, and input the exchanged reconstruction characteristics to the decoder.

In one or more embodiments of the present disclosure, when exchanging the decoupling characteristics of at least some of the training samples output from the encoder of the self-encoder according to the labels of the training samples, the server may determine, for each training sample, each training sample that is at least partially identical to the label of the training sample to construct a label association relationship between the training samples, and determine, according to the label association relationship, each training sample group, to determine, for each training sample group, a partial label that is identical to each training sample in the training sample group as a target label.

The server may then exchange the target decoupling characteristics for each training sample in the set of training samples with the decoupling characteristics characterizing the target tag as target decoupling characteristics. And determining the exchange characteristics according to the decoupling characteristics of each training sample after the training samples are exchanged. So that after the exchange characteristics of each training sample are obtained, the exchange characteristics of each training sample are respectively input into a decoder to determine the exchange reconstruction characteristics corresponding to each training sample.

Fig. 3 is a schematic diagram of a label association relationship provided in the present specification. As shown in the figure, each diagonal filled rectangle in the figure represents each training sample, and the training samples connected by the double-headed arrow are training samples with an association relationship, and it can be seen that the rectangle above in fig. 3 is connected with the rectangle below left by the attribute a and the attribute B, that is, the attribute a and the attribute B of the two rectangles are the same. The upper rectangle is connected with the lower right rectangle by attribute D, i.e. the upper and lower right rectangles have the same attribute D. The lower left rectangle is connected with the lower right rectangle through the attribute B and the attribute C, namely the attribute B and the attribute C of the lower left rectangle and the lower right rectangle are the same.

In one or more embodiments of the present disclosure, when determining a training sample set, for each training sample, the server may determine, according to the label association relationship, other training samples that are at least partially identical to the training sample and are identical to the training sample, as associated training samples of the training sample, and use the identical partial labels as associated labels, and use attributes corresponding to the associated labels as associated attributes. And then, the server can determine each training sample group according to the training samples, the determined associated training samples and preset grouping values. The grouping value is used to represent the number of training samples obtained from each training sample group, and may be specifically set according to needs, for example, the grouping value may be 2, that is, one training sample group may include 2 training samples, or the grouping value may also be 3 or other, which is not limited herein.

Taking the grouping value of 2 as an example for explanation, assuming that the association attribute is lane interval distribution, the server can exchange two training samples in the training sample group with decoupling features for representing lane interval distribution, and splice each decoupling feature of each training sample for each training sample after exchange to obtain exchange features of the training samples.

Taking the example that the grouping number is 3 as an example, the training sample group X includes training samples X1, X2, and X3 assuming that the correlation attribute is a parallel position distribution. The server can change the decoupling characteristics corresponding to the parallel position distribution in the decoupling characteristics of the training sample X1 to the training sample X2, change the decoupling characteristics corresponding to the parallel position distribution in the decoupling characteristics of the training sample X2 to the training sample X3, and change the decoupling characteristics corresponding to the parallel position distribution in the decoupling characteristics of the training sample X3 to the training sample X1 so as to realize the exchange of the decoupling characteristics. After the exchange, the server may splice each decoupling feature of the training sample for each training sample in the training sample set X to obtain an exchange feature of the training sample.

In one or more embodiments of the present disclosure, since different decoupling features correspond to different attributes, that is, are used to characterize different attributes, when the decoupling features exchanged by the training samples in each training sample group are spliced, the splicing may be performed according to a preset sequence.

Fig. 4 is a schematic diagram of a feature exchange provided in the present specification. As shown in the figure, the grouping number is 2, the rectangles A1, A2, A3, A4 are the decoupling features of the training sample X1 in the training sample group X, the rectangles B1, B2, B3, B4 are the decoupling features of the training sample X2 in the training sample group X, wherein the diagonally filled rectangles represent the decoupling features corresponding to the lane interval distribution, the grid filled rectangles represent the decoupling features corresponding to the front speed features, the horizontally filled rectangles represent the decoupling features corresponding to the rear speed features, and the vertically filled rectangles represent the decoupling features corresponding to the parallel position distribution. Because the decoupling characteristics A1 of the training samples X1 used for representing lane interval distribution are the same as the decoupling characteristics B1 of the training samples X2 used for representing lane interval distribution, the A1 and the B1 are exchanged, and after the exchange, the server splices the decoupling characteristics of each training sample to obtain the exchange characteristics corresponding to each training sample. Wherein C1 represents the exchange characteristics of training sample X1 and C2 represents the exchange characteristics of training sample X2. The sequence of splicing the decoupling features after the training samples are exchanged is the same, and the decoupling features are lane interval distribution, front speed feature, rear speed feature and parallel position distribution.

In one or more embodiments of the present disclosure, the server may exchange the same decoupling characteristics together when there are multiple same decoupling characteristics between training samples in the same training sample set.

In addition, in one or more embodiments of the present disclosure, before exchanging the decoupling characteristics of at least a portion of the training samples output from the encoder of the self-encoder, the server may further input, for each training sample, each decoupling characteristic of the training sample to the decoder of the self-encoder, and determine a reconstruction characteristic corresponding to the training sample. So that the loss is determined from the reconstruction features in a subsequent step.

S208: a loss is determined based at least on the environmental characteristics of the training samples and the exchange reconstruction characteristics to adjust the parameters of the self-encoder.

In one or more embodiments of the present disclosure, the server may determine the loss based at least on the environmental characteristics of the training samples and the exchange reconstruction characteristics, and adjust the parameters of the self-encoder with the goal of minimizing the loss.

In one or more embodiments herein, in determining the loss, the server may determine, for each training sample, the loss based on differences between environmental characteristics and exchange reconstruction characteristics of the training sample.

In one or more embodiments of the present disclosure, in determining the loss, the server may further determine, for each training sample, a reconstruction loss based on differences between environmental features and reconstruction features of the training sample, and a swap loss based on differences between environmental features and swap reconstruction features of the training sample. Thereafter, a total loss is determined based on the reconstruction loss and the exchange loss of at least a portion of the training samples.

Wherein the reconstruction loss characterizes a difference between the self-encoder input and output. I.e. characterizing the difference between the ambient characteristics input to the self-encoder and the reconstructed characteristics of the ambient characteristics output from the self-encoder.

In the present specification, the self-encoder is trained according to the reconstruction loss, and the object is to correctly restore the environmental feature after inputting the decoupling feature obtained by decoupling the environmental feature by the encoder of the self-encoder into the decoder. That is, the object is to make the reconstruction feature output from the decoder of the encoder identical to the environmental feature of the input encoder. And the accuracy of decoupling ensures that the reconstruction is accurate, and when the difference between the reconstruction feature and the environmental feature is smaller, the more accurate the decoupling features obtained by decoupling the environmental feature by the encoder of the self-encoder are, the more the decoupling features obtained by decoupling can accurately represent the environmental feature.

The switching loss characterizes the difference between the results of the same tag decoupling from the encoder of the encoder. The decoupling characteristics of the same label as other training samples are exchanged with the other training samples in the decoupling characteristics of the training samples obtained by the encoder input from the encoder, and after the training samples are reconstructed by the decoder, the environmental characteristics of the training samples are different from the exchanged reconstruction characteristics. By targeting the minimum difference between the ambient characteristics of the input self-encoder and the exchange reconstruction characteristics of the output self-encoder, it is possible to achieve the minimum difference between the results of the self-encoder's decoupling of the same tag, which is another object of the present specification to train the self-encoder.

And based on the training of the self-encoder by the exchange loss, all decoupling characteristics obtained by decoupling the environmental characteristics of the encoder of the self-encoder are mutually independent, and all the decoupling characteristics obtained by decoupling the training sample by the encoder in the self-encoder have stable and accurate corresponding relations with all the labels of the training sample.

In one or more embodiments of the present disclosure, after determining the total loss, the server may adjust the parameters of the self-encoder with the goal of minimizing the total loss.

In one or more embodiments of the present disclosure, the number of training samples involved in determining the total loss may be set as desired when determining the total loss, and the present disclosure is not limited thereto. For example, the total loss may be determined once based on the reconstruction loss and the exchange loss for each training sample in a training sample set, or determined once based on the reconstruction loss and the exchange loss for each training sample in a plurality of training samples.

In one or more embodiments of the present disclosure, taking an example of determining a total loss according to a reconstruction loss and an exchange loss of each training sample in one training sample group, when determining the total loss according to the reconstruction loss and the exchange loss of each training sample, the server may determine, for each training sample group, a reconstruction loss and an exchange loss corresponding to each training sample in the training sample group, and determine the total loss according to each reconstruction loss and each exchange loss corresponding to the training sample group.

In one or more embodiments provided herein, the formula for determining the reconstruction loss may be specifically as follows:

Loss1＝‖S _i -S′ _i ‖ ²

where Loss1 represents the reconstruction Loss of the ith training sample, S _i Representing the ith training sample, i.e. the ith environmental feature, S' _i Representing the reconstructed characteristics of the ith training sample.

In one or more embodiments provided herein, the formulas for determining exchange loss may be specifically as follows:

wherein Loss2 represents exchange Loss, S _i Representing the ith training sample of a training sample set, i.e. the ith environmental feature, S _i Representing the exchange reconstruction characteristics of the ith training sample in the training sample set. n represents the number of training samples in the training sample set.

In addition, in step S204 of the present specification, when determining that the lane interval distribution exists, if there are two lanes in which the distance between the obstacle and the collecting device is the same, the server may determine whether the two lanes include the lane in which the collecting device is located, if yes, determine that the distance (interval) between the obstacle corresponding to the lane in which the collecting device is located and the collecting device is the largest, and use the distance between the obstacle corresponding to the lane in which the collecting device is located and the collecting device as the determined lane interval distribution.

For example, assuming that the lane includes a, b, c, in the environmental scenario a, the acquisition device is in lane b, in front of which there is an obstacle O in lane a ₁ With obstacle O in lane b ₂ In lane c there is an obstacle O ₃ And O is ₁ Distance from the collecting device is 10m, O ₂ Distance from the collecting device is 20m, O ₃ The distance from the acquisition device is also 20m. The server may determine that the distance of the obstacle in lane b from the acquisition device is greatest. When the interval distribution characteristic of the environmental scene A is shown, the interval distribution characteristic can be shown by using the mark of a lane b, for example, a lane where the acquisition device is located is shown by 1, a lane on the left side of the acquisition device is shown by 0, and a lane on the right side of the acquisition device is shown by 2. The interval distribution feature of the environmental scene a may be represented by 0.

In one or more embodiments provided herein, when determining each attribute, specifically, the server may determine a distribution of intervals of the environmental scene according to a position of the collecting device and a position of each obstacle included in the environmental data of the environmental scene, and determine a velocity distribution of the environmental scene according to a velocity of the collecting device and a velocity of each obstacle included in the environmental data. And determining parallel positions of adjacent lanes of the acquisition equipment, judging whether obstacles exist in each parallel position, and determining lane interval distribution according to a judging result.

In one or more embodiments of the present specification, a determination result of whether an obstacle exists at the parallel position may be represented by 1 and 0, for example, the existence of an obstacle at the parallel position may be represented by 1, and the absence of an obstacle at the parallel position may be represented by 0. (1, 0) indicates that there is an obstacle in the left side parallel position of the unmanned device and there is no obstacle in the right side parallel position, and (1, 1) indicates that there is an obstacle in both the left side parallel position and the right side parallel position of the unmanned device.

In one or more embodiments of the present specification, the lane interval distribution of the environmental scene may be represented by a decimal number, for example, for the judgment result of (1, 0), may be represented by a decimal number 2, and (1, 1) may be represented by a decimal number 3. The server may determine the decimal value of the judgment result as the lane interval distribution.

In one or more embodiments of the present disclosure, when there is no adjacent lane on one side or both sides of the unmanned apparatus, taking the example that there is no adjacent lane on the right side of the unmanned apparatus, the right side of the unmanned apparatus does not have an adjacent lane, which means that the right side cannot travel, and similar to the presence of an obstacle on the right side, the front speed characteristics and the rear speed characteristics of the obstacle in the non-existing adjacent lane on one side, and the lane interval distribution can be determined according to the default values. The default values may be set as needed, and the present specification is not limited herein. When one side or two sides of the acquisition device do not have adjacent lanes, the obstacles in the adjacent lanes on the non-existing side can be expressed as default values. For example, if 1 indicates that an obstacle is present, 1 may indicate the distribution of the obstacle in the adjacent lane on the side where the obstacle is not present.

In one or more embodiments of the present disclosure, the server may determine, according to a body foremost end of the collecting apparatus, a position line perpendicular to a lane direction where the body foremost end of the collecting apparatus is located, and use a distance between an obstacle in each lane and the position line as a distance, i.e., a gap, of the collecting apparatus.

Fig. 5 is a schematic diagram of an interval provided in the present specification. As shown in the figure, gray filled rectangles represent unmanned devices as acquisition devices, and white filled rectangles represent obstacles. The horizontal broken line indicates a position line, in which the distance between the obstacle and the position line is G1 in the left adjacent lane of the unmanned apparatus, and the distance between the obstacle and the position line is G2 in the right adjacent lane of the unmanned apparatus, and G1 is larger than G2. In the lane where the unmanned device is located, no obstacle exists, and when the interval distribution feature is determined, the lane with the largest interval between the unmanned device and the obstacle is determined, and the larger the interval is, the larger the movement space of the unmanned device is and the safer is, therefore, although G1 is larger than G2, in the lane where the unmanned device is located, no obstacle exists, and the lane with the largest interval between the unmanned device and the obstacle can be determined as the lane where the unmanned device is located.

In one or more embodiments of the present description, the server may also train the decision model through deep reinforcement learning.

Fig. 6 is a flow chart of a method for training a decision model provided in the present specification. The method flow for training the decision model can comprise the following steps:

S300: and determining the current environmental characteristics according to the current movement data of the unmanned equipment, the movement data of surrounding obstacles and the destination position of the unmanned equipment, wherein the movement data at least comprises position and speed.

In one or more embodiments of the present disclosure, the decision model may be subjected to deep reinforcement learning from the trained encoder by simulating the movement of the unmanned device in the environment. The server can simulate the obtained motion trail of the unmanned equipment in the motion process, the states (i.e. environmental characteristics) of the corresponding moments of the motion trail and the actions (i.e. executed decisions) of the corresponding states of the moments. And calculating rewards corresponding to the actions respectively. After determining the state of the unmanned equipment, the server can input the state of the unmanned equipment into an encoder to obtain each decoupling characteristic with interpretability, and then input each decoupling characteristic into the decision model to obtain a decision corresponding to the state, namely action.

In one or more embodiments of the present description, first, the server may determine a current environmental characteristic based on current movement data of the unmanned device, movement data of surrounding obstacles, and a destination location of the unmanned device, the movement data including at least a location and a speed.

S302: and inputting the environmental characteristics into an encoder in a pre-trained self-encoder, decoupling the environmental characteristics, and determining decoupling characteristics corresponding to the environmental characteristics, wherein the decoupling characteristics are used for representing the obstacle position distribution of each lane and the speed distribution of each lane obstacle and the unmanned equipment.

In one or more embodiments of the present disclosure, after determining the environmental feature, the server may input the environmental feature into an encoder in a pre-trained self-encoder, decouple the environmental feature, determine each decoupling feature corresponding to the environmental feature, where the decoupling feature is used to characterize an obstacle location distribution of each lane and a speed distribution of each lane obstacle and the unmanned device.

The training process of the self-encoder may refer to the foregoing steps, and the description is omitted herein.

S304: and inputting each decoupling characteristic into a decision model to be trained, determining a decision corresponding to the environmental characteristic, and controlling the unmanned equipment to move according to the decision.

In one or more embodiments of the present disclosure, after obtaining each decoupling feature, the server may input each decoupling feature into a decision model to be trained, and control the unmanned device according to a decision output by the decision model.

S306: and re-determining the movement data of the unmanned device and surrounding obstacles to determine the distance between the unmanned device and the obstacles and the relative speed.

In one or more embodiments of the present disclosure, after controlling the unmanned device according to the decision output by the decision model, the server may re-determine the movement data of the unmanned device and surrounding obstacles to determine the distance and relative speed of the unmanned device from the obstacles.

The purpose of controlling the unmanned device according to the decision is: the unmanned equipment can safely travel to a destination, and the danger possibly occurs in the traveling process, namely, collision danger.

Therefore, in one or more embodiments of the present disclosure, after controlling the unmanned device according to the decision output by the decision model, the server may further determine whether the unmanned device collides, if so, determine a penalty corresponding to the decision, stop training the decision model, and re-determine environmental characteristics to train the decision model. If not, the movement data of the unmanned device and surrounding obstacles are re-determined to determine the distance and the relative speed between the unmanned device and the obstacles.

S308: determining rewards corresponding to the decisions according to the determined distance and the determined relative speed, and adjusting parameters of the decision model by taking the maximum rewards as an optimization target.

In one or more embodiments of the present disclosure, after determining the distance and the relative speed between the unmanned device and each obstacle, the server may determine, at least according to the determined distance and the determined relative speed, a reward corresponding to the decision, and adjust parameters of the decision model with the reward being the maximum target of optimization.

In one or more embodiments of the present disclosure, when determining the reward corresponding to the decision, the server may redetermine a distance to be travelled between the unmanned device and the destination location, and determine a collision time of the unmanned device with each obstacle according to the redetermined distance and relative speed of the unmanned device with each obstacle. Determining rewards corresponding to the decision according to the collision time, the speed of the unmanned equipment and the distance to be driven;

wherein the distance to be travelled is inversely related to the reward, as the purpose of the unmanned device is to travel safely towards the destination. Since the unmanned device is more dangerous as the collision time is smaller, the collision time is positively correlated with the reward. And, the speed of the unmanned device is positively correlated with the reward, and the distance of the unmanned device from each obstacle is positively correlated with the reward.

In one or more embodiments of the present disclosure, in order for the unmanned device to smoothly perform a decision, safely arrive at a destination without colliding with an obstacle, the server may further determine a steering wheel angle change rate of the unmanned device and an acceleration of the unmanned device when controlling the unmanned device according to the decision when determining a reward corresponding to the decision, and re-determine a distance to be travelled between the unmanned device and the destination location.

And then, the server can determine rewards corresponding to the decision according to the redetermined speed of the unmanned equipment, the steering wheel angle change rate, the acceleration of the unmanned equipment, the redetermined distance to be driven, the distance between the unmanned equipment and each obstacle and the relative speed between the unmanned equipment and each obstacle.

Wherein the steering wheel angle change rate is inversely related to the reward, the acceleration is inversely related to the reward, the redetermined distance to be travelled is inversely related to the reward, the redetermined speed of the unmanned device is positively related to the reward, and the distance between the unmanned device and each obstacle is positively related to the reward. Since the distance between the obstacle and the unmanned device is more stable when the relative speed between the obstacle and the unmanned device is smaller, the possibility of collision is smaller, and thus the relative speed between the unmanned device and each obstacle is inversely related to the reward.

Of course, the rewarding method for calculating the decision provided in the present specification is merely an example, and specifically the rewarding may be determined according to one or more of the collision time between the unmanned device and each obstacle, the distance to be driven, the distance between the unmanned device and each obstacle, the relative speed between the unmanned device and each obstacle, the speed of the unmanned device, or the like, or calculated by other methods, which is not limited in this specification.

In the present specification, the state is converted into the interpretable decoupling feature by the decoder in the self-encoder, which can help training the decision model, and improve training efficiency and performance of the model obtained by training.

In one or more embodiments of the present disclosure, the reward function corresponding to the decision model may be specifically as follows:

r＝r ₁ +r ₂ +r ₃

r ₂ ＝-l+v

wherein r represents a reward function, r ₁ Representing a security-related prize, r ₂ Representing rewards related to efficiency of movement to destination, r ₃ Representing rewards related to smoothness of movement. l denotes the distance of the unmanned device from the destination. v denotes the speed of the unmanned device after the decision has been made, acc denotes the acceleration of the unmanned device after the decision has been made, Indicating the rate of change of steering wheel angle at which the unmanned device performs the decision.

In one or more embodiments of the present disclosure, after the server controls the unmanned device to perform the decision, if the unmanned device does not collide, r ₁ ＝d _p +ttc, when the server controls the unmanned device to execute the decision, if the unmanned device collides, r ₁ = -w. And w is a positive number, which may be set as desired, for example, 100.

In one or more embodiments of the present description, an optimization objective may be determined according to the PPO (Proximal Policy Optimization) algorithm, and parameters of the decision model are optimized according to the optimization objective.

In one or more embodiments of the present description, the optimization objective of the decision model may be expressed as:

J＝E _t [min(r _t (θ)A _t ,clip(r _t (θ),1-ε,1+ε)A _t )]

wherein J is an optimization target, θ is a policy parameter of the decision model to be optimized, E _t For the expected value at time t, ε is a super-parameter that limits the policy update amplitude, clip is a clipping function for applying r _t The value of (theta) is cut into the range of (1-epsilon, 1+epsilon), r _t (theta) is the ratio of the strategy at the time t of each iterative updating of the decision model to the old strategy at the time t of the last iterative updating, namely

π(a _t |s _t ) Representing the policy at each iteration update, pi _old (a _t |s _t ) Representing the old policy at the last iteration update. A is that _t Represents a merit function calculated based on the bonus function r, and

A _t ＝δ _t +(γλ)δ _t+1 +…+(γλ) ^T-t+1 δ _T-1

δ _t ＝r _t +γV(s _t+1 )-V(s _t )

wherein, gamma is a preset first buckling factor, lambda is a first buckling factor preset according to GAE (General Advantage Estimation) algorithm, delta _t And (3) the time difference error value at the time T is the total duration of the collected motion trail of the unmanned equipment. V(s) _t ) I.e. the value function corresponding to time t.

In one or more embodiments of the present disclosure, ε may be 0.2 and γ may be 0.95.

In one or more embodiments of the present disclosure, the optimization objective J of the PPO algorithm may be maximized to update the neural network weights θ based on the motion trajectories (co-NT step data) of total duration T (i.e., time step T) for N parallel trains at a time using a random gradient descent method (Stochastic Gradient Descent, SGD).

Based on the same thought, the present specification also provides a control device of the corresponding unmanned device, as shown in fig. 7.

Fig. 7 is a schematic diagram of a control device of an unmanned apparatus provided in the present specification, where the device includes:

an environmental characteristic determining module 400, configured to determine a current environmental characteristic according to current motion data of an unmanned device, motion data of surrounding obstacles, and a destination position of the unmanned device, where the motion data includes at least a position and a speed;

A decoupling module 401, configured to input the environmental feature into an encoder in a pre-trained self-encoder, decouple the environmental feature, and determine each decoupling feature corresponding to the environmental feature, where the decoupling feature is used to characterize an obstacle position distribution of each lane and a speed distribution of each lane obstacle and the unmanned device;

the control module 402 is configured to input each decoupling feature into a decision model obtained by reinforcement learning in advance, determine a decision corresponding to the environmental feature, and control the movement of the unmanned device according to the decision;

the apparatus further comprises:

the training module 403 is configured to divide environmental data collected during a running process of the collecting device into a plurality of environmental segments according to a preset time interval, determine, for each environmental segment, environmental characteristics of the environmental segment according to environmental data corresponding to the environmental segment, and use the environmental characteristics of the environmental segment as a training sample, determine, according to environmental characteristics of the environmental segment, obstacle position distribution of each lane and speed distribution of each lane obstacle and the collecting device, use the environmental characteristics of each lane as a label of the training sample, exchange decoupling characteristics of at least part of training samples output by an encoder of the self-encoder according to the labels of each training sample, and input the decoupling characteristics to a decoder to obtain exchanged reconstructed characteristics, and determine loss at least according to the environmental characteristics and the exchanged reconstructed characteristics of the training sample, so as to adjust parameters of the self-encoder.

Optionally, the training module 403 is further configured to determine, according to an environmental characteristic of the environmental segment, a parallel position of two sides of the collecting device in an adjacent lane of the lane where the collecting device is located, determine, as a parallel position distribution, a distance between an obstacle in front of the collecting device and the collecting device in each lane according to the environmental characteristic, determine, according to a distance corresponding to each lane, a lane interval distribution, and determine, according to the parallel position distribution and the lane interval distribution, a barrier position distribution of each lane.

Optionally, the training module 403 is further configured to determine, for each lane, a front target obstacle according to an obstacle in front of the collection device in the lane, determine a front speed characteristic of the lane and the collection device according to a speed comparison relation between the front target obstacle and the collection device, determine a rear target obstacle according to an obstacle behind the collection device in the lane, determine a rear speed characteristic of the lane and the collection device according to a speed comparison relation between the rear target obstacle and the collection device, and determine a speed distribution of each lane obstacle and the collection device according to the front speed characteristic and the rear speed characteristic of each lane.

Optionally, the training module 403 is further configured to determine, for each training sample, each training sample that is at least partially identical to the training sample label, so as to construct a label association relationship between training samples, determine, according to the label association relationship, each training sample group, determine, for each training sample group, a partial label that is identical to each training sample in the training sample group, as a target label, use a decoupling feature that characterizes the target label as a target decoupling feature, and exchange target decoupling features of each training sample in the training sample group.

Optionally, the training module 403 is further configured to, for each training sample, input each decoupling feature of the training sample to a decoder of the self-encoder, and determine a reconstruction feature corresponding to the training sample.

Optionally, the training module 403 is further configured to determine, for each training sample, a reconstruction loss according to a difference between an environmental characteristic and a reconstruction characteristic of the training sample, where the reconstruction loss characterizes a difference between an input and an output of the self-encoder, and determine, according to a difference between an environmental characteristic and an exchange reconstruction characteristic of the training sample, an exchange loss characterizing a difference between results of decoupling of an encoder of the self-encoder from the same label, and determine, according to a reconstruction loss and an exchange loss of at least part of the training samples, a total loss.

The apparatus further comprises: and the adjustment module 404 is configured to redetermine the motion data of the unmanned device and surrounding obstacles to determine a distance and a relative speed between the unmanned device and the obstacles, determine a reward corresponding to the decision according to the determined distance and relative speed, and adjust parameters of the decision model with the maximum reward as an optimization target.

The adjustment module 404 is further configured to determine whether the unmanned device collides, if so, determine a penalty corresponding to the decision, stop a training process of the decision model currently, and redetermine environmental features to continue training the decision model, and if not, redetermine motion data of the unmanned device and surrounding obstacles to determine a distance and a relative speed between the unmanned device and the obstacles.

Optionally, the adjusting module 404 is further configured to re-determine a distance to be travelled between the unmanned device and the destination location, determine a collision time between the unmanned device and each obstacle according to the re-determined distance between the unmanned device and each obstacle and a relative speed, and determine a reward corresponding to the decision according to the collision time, the speed of the unmanned device and the distance to be travelled, where the distance to be travelled is inversely related to the reward, the collision time is positively related to the reward, the speed of the unmanned device is positively related to the reward, and the distance between the unmanned device and each obstacle is positively related to the reward.

Optionally, the adjusting module 404 is further configured to determine, when the unmanned device is controlled according to the decision, a steering wheel angle change rate of the unmanned device and an acceleration of the unmanned device, and redetermine a distance to be travelled between the unmanned device and the destination location, and determine, according to the redetermined speed of the unmanned device, the steering wheel angle change rate, the acceleration, the redetermined distance to be travelled, a distance between the unmanned device and each obstacle, and a relative speed between the unmanned device and each obstacle, a prize corresponding to the decision, wherein the steering wheel angle change rate is inversely related to the prize, the acceleration is inversely related to the prize, the redetermined distance to be travelled is inversely related to the prize, the redetermined speed of the unmanned device is inversely related to the prize, the distance between the unmanned device and each obstacle is inversely related to the positive prize, and the relative speed between the unmanned device and each obstacle is inversely related to the prize.

The present specification also provides a computer-readable storage medium storing a computer program operable to execute the control method of the above-described unmanned apparatus.

The present specification also provides a schematic structural diagram of the electronic device shown in fig. 8. At the hardware level, as shown in fig. 8, the electronic device includes a processor, an internal bus, a memory, and a nonvolatile memory, and may of course include hardware required by other services. The processor reads the corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to realize the control method of the unmanned equipment.

Of course, other implementations, such as logic devices or combinations of hardware and software, are not excluded from the present description, that is, the execution subject of the following processing flows is not limited to each logic unit, but may be hardware or logic devices.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented by using "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but not just one of the hdds, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog are currently most commonly used. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer readable medium storing computer readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, application specific integrated circuits (Application Specific Integrated Circuit, ASIC), programmable logic controllers, and embedded microcontrollers, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, atmel AT91SAM, microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic of the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller may thus be regarded as a kind of hardware component, and means for performing various functions included therein may also be regarded as structures within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being functionally divided into various units, respectively. Of course, the functions of each element may be implemented in one or more software and/or hardware elements when implemented in the present specification.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the present specification may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present description can take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The foregoing is merely exemplary of the present disclosure and is not intended to limit the disclosure. Various modifications and alterations to this specification will become apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present description, are intended to be included within the scope of the claims of the present description.

Claims

1. A control method of an unmanned apparatus, characterized by comprising:

Inputting each decoupling characteristic into a decision model obtained by reinforcement learning in advance, determining a decision corresponding to the environmental characteristic, and controlling the unmanned equipment to move according to the decision;

the self-encoder is trained by the following method:

2. The method of claim 1, wherein determining the obstacle location profile for each lane based on the environmental characteristics of the environmental segment, comprises:

3. The method of claim 1, wherein determining a speed profile of each lane obstacle and the acquisition device, in particular comprises:

4. The method according to claim 1, wherein the exchanging of the decoupling characteristics of at least part of the training samples output from the encoder of the self-encoder is based on the labels of the training samples, in particular comprising:

5. The method of claim 4, wherein prior to exchanging decoupling characteristics of at least a portion of training samples output from an encoder of the self-encoder, the method further comprises:

6. The method of claim 5, wherein determining the loss based at least on the environmental characteristics of the training samples and the exchange reconstruction characteristics, comprises:

7. The method of claim 1, wherein after controlling the movement of the unmanned device according to the decision, the method further comprises:

8. The method of claim 7, wherein the motion data of the unmanned device and surrounding obstacles is re-determined to determine the distance and relative speed of the unmanned device from the obstacles, in particular comprising:

judging whether the unmanned equipment collides or not;

9. The method of claim 7, wherein determining the reward corresponding to the decision based on the determined distance and the relative speed, comprises:

10. The method of claim 7, wherein determining the reward corresponding to the decision based on the determined distance and the relative speed, comprises:

11. A computer readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-10.

12. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method of any of the preceding claims 1-10 when executing the program.