CN115047864A

CN115047864A - Model training method, unmanned equipment control method and device

Info

Publication number: CN115047864A
Application number: CN202210161211.8A
Authority: CN
Inventors: 熊方舟; 李伟; 丁曙光; 张羽; 周奕达; 樊明宇; 任冬淳
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2022-02-22
Filing date: 2022-02-22
Publication date: 2022-09-13

Abstract

The present specification discloses a model training method, a control method of an unmanned aerial vehicle, and an apparatus, which acquire historical state data, and then input the historical state data into a long-short term memory network trained in advance in time series to predict predicted state data of each obstacle after a set historical time. And then inputting the historical state data and the prediction state data into a decision model to be trained so as to determine interesting data from the environmental data of the environment where the designated equipment is located at the set historical moment through the weight corresponding to the attention mechanism network. And finally, determining a corresponding reward value of the designated equipment after driving according to the control parameters at the set historical moment, and training the decision model. The method can determine the interesting data from the environmental data of the environment where the designated equipment is located at the set historical moment, and measure the reasonable degree of the determined control parameters through the reward value, thereby effectively ensuring the safe driving of the designated equipment.

Description

Model training method, unmanned equipment control method and device

Technical Field

The present disclosure relates to the field of unmanned driving technologies, and in particular, to a model training method, an unmanned device control method, and an apparatus.

Background

Currently, a decision model is usually trained by using real human driving data, a Long Short-Term Memory network (LSTM) is added to the decision model, and a control parameter corresponding to a designated device at a future time is predicted according to a driving decision, i.e., a control parameter, determined by the designated device at each historical time. The method depends on expert experience, needs a large amount of sample data, and has weak migration capability and generalization capability. Moreover, although the long-short term memory network has a memory function, the historical driving decision of the designated device is not important in the process of strengthening the learning decision. Therefore, in practical application, the control parameters corresponding to the specified equipment predicted by the method at the future time have low success rate of avoiding obstacles, and have the possibility of collision with other surrounding obstacles, so that the safety is low.

Therefore, how to plan a reasonable driving track by the unmanned equipment according to the interaction situation of surrounding traffic participants is an urgent problem to be solved.

Disclosure of Invention

The present specification provides a model training method, a control method of an unmanned aerial vehicle, and an apparatus, which partially solve the above problems in the prior art.

The technical scheme adopted by the specification is as follows:

the present specification provides a method of model training, comprising:

acquiring historical state data corresponding to the designated equipment and each barrier at each historical moment;

inputting the historical state data into a pre-trained long-short term memory network according to a time sequence to predict state data of each obstacle after the historical time is set as predicted state data;

inputting the historical state data and the predicted state data into a decision model to be trained, and determining interesting data from environment data of an environment where the designated equipment is located at the set historical moment through weights corresponding to an attention mechanism network;

determining a control parameter corresponding to the designated equipment at the set historical moment according to the interested data;

and determining a reward value corresponding to the designated equipment after driving according to the control parameter at the set historical moment according to the interested data and the control parameter, and training the decision model according to the reward value.

Optionally, the decision model comprises: evaluating the submodels;

determining a reward value corresponding to the designated device after driving according to the control parameter at the set historical moment according to the interested data and the control parameter, wherein the reward value comprises:

inputting the interested data and the control parameters into the evaluation submodel, predicting a reward value corresponding to the appointed equipment after driving according to the control parameters at the set historical moment as a reward value to be optimized, and determining an actual reward value corresponding to the appointed equipment after driving according to the control parameters at the set historical moment according to the interested data and the control parameters;

training the decision model according to the reward value, including:

and taking the reward value to be optimized to approach the actual reward value as an optimization target, and training the decision model.

Optionally, the decision model comprises: a target evaluation submodel and a target strategy submodel;

determining an actual reward value corresponding to the designated device after driving according to the control parameter at the set historical moment according to the interested data and the control parameter, wherein the actual reward value comprises:

determining the reward value of the designated equipment under the environment at the set historical moment according to the interested data and the control parameter, and taking the reward value as the reward value corresponding to the set historical moment;

through the attention mechanism network, predicting the interested data of the specified equipment after driving according to the control parameters at the set historical time as predicted interested data, inputting the predicted interested data into the target strategy submodel, and determining the control parameters of the specified equipment after the set historical time as predicted control parameters;

inputting the prediction interest data and the prediction control parameters into the target evaluation submodel, and determining a prediction reward value of the designated equipment after the set historical moment;

and determining the actual reward value according to the reward value corresponding to the set historical moment and the predicted reward value.

Optionally, the target evaluation submodel includes: a first target evaluation submodel and a second target evaluation submodel;

inputting the prediction interest data and the prediction control parameters into the target evaluation submodel, and determining a predicted reward value of the designated device after the set historical time, wherein the method comprises the following steps:

inputting the predicted interest data and the prediction control parameter into the first target evaluation submodel, determining a predicted reward value of the designated device after the set historical time as a first candidate reward value, and inputting the predicted interest data and the prediction control parameter into the second target evaluation submodel, determining a predicted reward value of the designated device after the set historical time as a second candidate reward value;

and using the smaller value of the first candidate reward value and the second candidate reward value as the predicted reward value.

Optionally, determining, according to the data of interest and the control parameter, a reward value of the specific device in an environment where the specific device is located at the set historical time includes:

determining a first influence factor corresponding to the set historical moment according to the interested data and the control parameter;

and determining the reward value of the specified device under the environment of the set historical time according to the first influence factor, wherein the first influence factor is used for representing the time difference between the time when the specified device reaches the specified point and the time when each obstacle reaches the specified point, and the greater the time difference is, the greater the reward value of the specified device under the environment of the set historical time is.

determining a second influence factor corresponding to the set historical moment according to the interested data and the control parameter;

and determining the reward value of the designated device in the environment at the set history moment according to the second influence factor, wherein the second influence factor is used for representing the passing efficiency of the designated device when the designated device runs according to the control parameter, and the greater the passing efficiency is, the greater the reward value of the designated device in the environment at the set history moment is.

Optionally, determining, according to the data of interest and the control parameter, an incentive value of the specific device in an environment where the specific device is located at the set historical time includes:

determining a third influence factor corresponding to the set historical moment according to the interested data and the control parameter;

and determining the reward value of the designated device under the environment at the set history moment according to the third influence factor, wherein the third influence factor is used for representing the state change degree of the designated device after driving according to the control parameter, and the greater the state change degree is, the smaller the reward value of the designated device under the environment at the set history moment is.

Optionally, the evaluation submodel includes: a first evaluation submodel and a second evaluation submodel;

inputting the interested data and the control parameters into the evaluation submodel, predicting a reward value corresponding to the designated device after driving according to the control parameters at the set historical moment, and taking the reward value as a reward value to be optimized, wherein the method specifically comprises the following steps:

inputting the interested data and the control parameters into the first evaluation submodel, and predicting a corresponding reward value of the designated equipment after driving according to the control parameters at the set historical moment to be used as a first reward value to be optimized;

inputting the interested data and the control parameters into the second evaluation submodel, and predicting a corresponding reward value of the designated equipment after driving according to the control parameters at the set historical moment to be used as a second reward value to be optimized;

taking the reward value to be optimized to approach the actual reward value as an optimization target, training the decision model, specifically comprising:

and training a first evaluation submodel in the decision model by taking the first reward value to be optimized to approach the actual reward value as an optimization target, and training a second evaluation submodel in the decision model by taking the second reward value to be optimized to approach the actual reward value as an optimization target.

Optionally, the decision model comprises: a strategy sub-model and a target strategy sub-model;

aiming at each round of training, taking the first value to be optimized to approach the actual reward value as an optimization target, updating model parameters of the first evaluation sub-model in the round of training based on a first parameter updating step length, and taking the second value to be optimized to approach the actual reward value as an optimization target, updating model parameters of the second evaluation sub-model in the round of training based on the first parameter updating step length;

updating the model parameters of the strategy submodel in the round of training according to the model parameters of the first evaluation submodel in the round of training, the model parameters of the second evaluation submodel in the round of training and the second parameter updating step length;

and updating the model parameters of the target strategy submodel in the round of training according to the model parameters and the soft updating coefficient of the strategy submodel in the round of training until preset conditions are met so as to finish the training of the target strategy submodel in the decision model.

The present specification provides a control method of an unmanned aerial vehicle, including:

acquiring state data corresponding to the unmanned equipment and each barrier at the current moment as current state data;

inputting the current state data into a pre-trained long-short term memory network to predict the state data of each obstacle after the current moment as predicted state data;

inputting the current state data and the predicted state data into a trained decision model, and determining control parameters corresponding to the unmanned equipment at the current moment, wherein the decision model is obtained by training through the model training method;

and controlling the unmanned equipment according to the control parameters corresponding to the unmanned equipment at the current moment.

The present specification provides an apparatus for model training, comprising:

the acquisition module is used for acquiring historical state data corresponding to the designated equipment and each barrier at each historical moment;

the prediction module is used for inputting the historical state data into a pre-trained long-short term memory network according to a time sequence so as to predict the state data of each obstacle after the historical time is set as predicted state data;

the input module is used for inputting the historical state data and the prediction state data into a decision model to be trained so as to determine interesting data from the environmental data of the environment of the designated equipment at the set historical moment through the weight corresponding to the attention mechanism network;

the determining module is used for determining the control parameters corresponding to the designated equipment at the set historical moment according to the interested data;

and the training module is used for determining a reward value corresponding to the designated equipment after driving according to the control parameter at the set historical moment according to the interested data and the control parameter, and training the decision model according to the reward value.

This specification provides a control apparatus of an unmanned aerial vehicle, including:

the acquisition module is used for acquiring the state data of the unmanned equipment and the obstacles at the current moment as the current state data;

the prediction module is used for inputting the current state data into a pre-trained long-short term memory network so as to predict the state data of each obstacle after the current moment as predicted state data;

the determining module is used for inputting the current state data and the predicted state data into a trained decision model and determining control parameters corresponding to the unmanned equipment at the current moment, wherein the decision model is obtained by training through the model training method;

and the control module is used for controlling the unmanned equipment according to the control parameters corresponding to the unmanned equipment at the current moment.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method of model training or method of controlling an unmanned device.

The present specification provides an unmanned aerial device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above method of model training or the method of controlling an unmanned aerial device when executing the program.

The technical scheme adopted by the specification can achieve the following beneficial effects:

in the model training method provided in the present specification, historical state data corresponding to a specific device and each obstacle at each historical time is acquired. Next, the historical state data is input into a long-short term memory network trained in advance in time series to predict the state data of each obstacle after the historical time is set as predicted state data. And then inputting the historical state data and the prediction state data into a decision model to be trained so as to determine interesting data from the environmental data of the environment where the designated equipment is located at the set historical moment through the weight corresponding to the attention mechanism network. Then, according to the interested data, the control parameter corresponding to the designated device at the set historical moment is determined. And finally, determining a reward value corresponding to the appointed equipment after driving according to the control parameter at the set historical moment according to the interested data and the control parameter, and training the decision model according to the reward value.

According to the model training method, historical state data can be input into a pre-trained long-short term memory network according to a time sequence, and prediction state data corresponding to each obstacle are determined. And determining interesting data from the environmental data of the environment where the designated equipment is located at the set historical moment through the weight corresponding to the attention mechanism network in the decision model. And then, according to the interested data and the control parameters, determining a corresponding reward value of the designated equipment after driving according to the control parameters at the set historical moment, and measuring the reasonable degree of the determined control parameters. By training the decision model in such a way, the probability of collision between the designated equipment and surrounding obstacles can be reduced, and collision between the designated equipment and the surrounding obstacles is avoided, so that safe driving of the designated equipment is effectively guaranteed.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating a method for model training provided in an embodiment of the present disclosure;

FIG. 2 is a diagram illustrating a model structure of a decision model provided in an embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a control method of an unmanned aerial vehicle according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a control device of an unmanned aerial vehicle provided in an embodiment of the present specification;

fig. 6 is a schematic structural diagram of an unmanned aerial vehicle provided in an embodiment of the present specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

In the embodiment of the present specification, before determining the control parameter according to the current state data corresponding to the specified device, a decision model trained in advance needs to be relied on, and a process of how to train the decision model will be described first, as shown in fig. 1.

Fig. 1 is a schematic flow chart of a method for training a model provided in an embodiment of the present specification, which specifically includes the following steps:

s100: and acquiring historical state data corresponding to the designated equipment and each obstacle at each historical moment.

The execution subject of the training method of the decision model provided in this specification may be a designated device, or may be an electronic device such as a server or a desktop computer.

In the embodiment of the present specification, the server may acquire history state data corresponding to the specified device and each obstacle at each history time. The designated device may be equipped with various sensors, such as a camera, a laser radar, a millimeter wave radar, a satellite positioning system, etc., for sensing the environment around the designated device during driving and acquiring required status data. The obstacle mentioned here may refer to an object that can be moved by surrounding vehicles, bicycles, pedestrians, and the like during the movement of the designated device, that is, an obstacle that can interfere with the movement of the designated device, or may refer to a stationary object that is stationary by surrounding trees, buildings, and the like during the movement of the designated device.

The designated device mentioned here may refer to a device dedicated to data acquisition, such as a human-driven automobile, a human-operated robot, and the like, or may refer to an unmanned device.

In this embodiment of the present specification, the acquired state data may include: position data specifying the device, and position data specifying obstacles around the device, velocity data specifying obstacles around the device, steering angle data specifying the device, and the like.

Of course, the server may determine, according to the acquired state data, relative position data between the designated device and each obstacle around the designated device, relative speed data between the designated device and each obstacle around the designated device, and the like.

Further, during the movement of the designated device, a plurality of obstacles may exist around the designated device, and therefore, the designated device may collect and acquire status data of the obstacles for each obstacle around the designated device.

The unmanned device referred to in this specification may refer to an unmanned vehicle, a robot, an automatic distribution device, or the like capable of realizing automatic driving. Based on this, the unmanned device to which the model training method provided by the present specification is applied can be used for executing distribution tasks in the distribution field, such as business scenarios for distribution of express delivery, logistics, takeaway, and the like by using the unmanned device.

S102: and inputting the historical state data into a long-short term memory network trained in advance according to a time sequence to predict the state data of each obstacle after the historical time is set as predicted state data.

In practical applications, the control parameters corresponding to the designated device at a future time are usually predicted through a Long Short-Term Memory network (LSTM) according to the driving decisions, i.e., the control parameters, determined by the designated device at each historical time. However, in the model training process under the reinforcement learning framework, the interaction between the designated device and each obstacle of the surrounding environment is more important for determining the control parameter corresponding to the future time than the control parameter determined by the designated device at each historical time. Based on the data, the server can predict the state data of each obstacle at the future time according to the specified equipment and the state data corresponding to each obstacle at each historical time through the long-short term memory network.

In the embodiment of the present specification, the server may input the historical state data into a long-short term memory network trained in advance in time series to predict the state data of each obstacle after the set historical time as predicted state data.

Specifically, the server may sort the historical state data in a time sequence, select the historical state data based on a preset time window, input the historical state data into a pre-trained long-short term memory network, and predict the state data of each obstacle after the preset historical time. The set history time mentioned here may refer to the current time. For example, the server may input historical state data within five seconds before the current time into a long-short term memory network trained in advance, and predict state data of each obstacle within one second in the future.

Further, for each obstacle, the historical speed data corresponding to the obstacle is sorted according to the time sequence, which can be expressed as vel _i ＝(v _t1 ,…,v _tn )，vel _i A speed data sequence, v, which may represent the ith obstacle _tn May represent the velocity data corresponding to the ith obstacle at time tn. The historical speed data corresponding to each obstacle may be represented as Vel (Vel) by being sorted in chronological order _i ,…,vel _n ) And n is the number of obstacles.

The server may then input historical speed data corresponding to each obstacle into a pre-trained long-short term memory network in time series, and output speed data for each obstacle at a future time. The speed data of each obstacle at the future time may be expressed as H ═ (v) _p1 ,…,v _pn )，v _pn May represent the speed data of the nth obstacle at a future time.

In the model training process of the long-short term memory network, the set historical time mentioned here may be the time with the latest time series in the historical state data selected according to a preset time window.

In the embodiment of the present specification, there may be a plurality of training methods for the long-short term memory network, for example, historical state data of the obstacles sorted in time sequence is acquired, state data of the obstacle after the historical time is set is predicted based on a preset time window, and the long-short term memory network is trained by minimizing a deviation between the predicted state data of the obstacle after the historical time and real state data of the obstacle after the historical time is set. The present specification does not limit the specific form of the training method of the long-short term memory network.

S104: and inputting the historical state data and the predicted state data into a decision model to be trained so as to determine interesting data from the environmental data of the environment where the designated equipment is located at the set historical moment through weights corresponding to the attention mechanism network.

In practical applications, the historical state data corresponding to all obstacles around the specified device is generally input into the decision model, however, not all obstacles will affect the subsequent driving decision of the specified device. For example, the specifying device determines the control parameters of the specifying device in a future period of time mainly in consideration of the state data of the obstacles on the current lane and the target lane to be replaced in the course of changing lanes. That is, the server is more concerned with status data that can affect the lane change of the specified device, with obstacles on other lanes having little or no effect on the control parameters of the specified device over a future period of time.

Therefore, the server needs to determine which state data in the historical state data corresponding to each obstacle has a larger influence on the control parameter of the designated device in a future period of time, and ignore the state data having a smaller influence on the control parameter of the designated device in the future period of time.

In this embodiment, the server may input the historical state data and the predicted state data into a decision model to be trained, so as to determine, by using weights corresponding to the attention mechanism network, data of interest from environment data of an environment in which a given device is located at a set historical time. The interesting data mentioned here can be used to represent data in which the state data corresponding to each obstacle has a large influence on the control parameter determined by the designated device. The specific formula for determining the corresponding weight of the attention mechanism network is as follows:

e _t ＝tanh(W[CurVel,CurPos,v _pt ]+b)

in the above formula, e _t Representing the data of interest determined after the tth obstacle passes through the attention mechanism network. tanh () may be used to represent an activation function such that the value of the output is between 0 and 1. W2]Can be used for characterizing the corresponding weight of the attention mechanism network, namely the corresponding network parameter of the attention mechanism network. CurVel may be used to characterize historical speed data for a given device and for each obstacle. CurPos can be used to characterize a given device andhistorical position data corresponding to each obstacle. v. of _pt The method can be used for representing the corresponding predicted state data of each obstacle at the future moment. b may be used to characterize the bias.

As can be seen from the above formula, for each obstacle, the server may adjust the weight corresponding to the attention mechanism network through training of the decision model, and determine state data with a large influence degree on the control parameter of the specified device in a future period of time in the historical state data corresponding to the obstacle, so that the specified device determines a more accurate control parameter.

S106: and determining the control parameters corresponding to the designated equipment at the set historical moment according to the interested data.

In this embodiment, the server may determine, according to the data of interest, a control parameter corresponding to the designated device at the set history time.

Specifically, the server may use a speed variation amount of the server and a steering angle variation amount of the server as control parameters. The concrete formula is as follows:

in the above formula, Δ v may be used to characterize the amount of speed change of a given device.

May be used to characterize the amount of steering angle change for a given device. The variable quantity is determined by the speed or steering angle of the specified equipment at the current moment and the speed or steering angle corresponding to the specified equipment at the last moment. For Δ v, if Δ v is greater than 0, it may be used to characterize an increase in the current speed of the device compared to the previous time, if Δ v is greater than 0, it may be used to characterize a decrease in the current speed of the device compared to the previous time, and for Δ v, it may be used to characterize an increase in the current speed of the device compared to the previous time

If it is

Greater than 0 may be used to characterize the angle specifying the steering angle for a left turn of the device at the current time as compared to the previous time. If it is

Less than 0 may be used to characterize the angle specifying the steering angle for a right turn of the device at the current time as compared to the previous time.

It should be noted that the speed variation and the steering angle variation of the specific device can be optimized through training and updating the decision model, so as to better control the specific device to drive.

S108: and determining a reward value corresponding to the appointed equipment after driving according to the control parameter at the set historical moment according to the interested data and the control parameter, and training the decision model according to the reward value.

In this embodiment, the server may determine, according to the data of interest and the control parameter, an award value corresponding to the designated device after driving according to the control parameter at the set historical time, and train the decision model according to the award value.

In an embodiment of the present specification, the decision model includes: and evaluating the submodels. The server can input the interested data and the control parameters into the evaluation submodel, predict the reward value corresponding to the appointed equipment after driving according to the control parameters at the set historical moment, and serve as the reward value to be optimized, and determine the actual reward value corresponding to the appointed equipment after driving according to the control parameters at the set historical moment according to the interested data and the control parameters.

Then, the reward value to be optimized is close to the actual reward value as an optimization target, and the decision model is trained.

Further, the decision model comprises: a policy submodel. The server may input the data of interest and the control parameters to the evaluation submodel to determine a predicted reward value for the given device at a set historical time.

Secondly, the server can predict the interested data of the specified device after driving according to the control parameters at the set historical time through the attention mechanism network, the predicted interested data is used as the predicted interested data, the predicted interested data is input into the strategy sub-model, and the control parameters of the specified device after the set historical time are determined and used as the predicted control parameters.

The server may then enter the predicted interest data and the predictive control parameters into the evaluation submodel to determine the predicted reward value for the given device after setting the historical time.

Finally, the server can determine the reward value to be optimized according to the predicted reward value corresponding to the set historical time and the predicted reward value after the set historical time. The specific formula is as follows:

in the above formula, s _j Can be used to characterize the data of interest corresponding to a given device at a set historical time j. a is _j Can be used to characterize the control parameter corresponding to the given device at the set historical time j. w may be used to characterize the model parameters of the evaluation submodel.

The method can be used for representing the reward value to be optimized of the driving track obtained by driving the specified device at the set historical moment j according to the control parameters. Wherein the content of the first and second substances,

may refer to a state-action cost function that characterizes the expectation of a cumulative reward value for a given device from a set historical time j to the complete travel trajectory. In the above formula, all states s after the history time j is set are not shown _j ，…，s _j+t And all actions a _j ，…，a _j+t . However, because

It is a desire that,

can be used to characterize the expectation of the cumulative prize value for a future period of time, implying that all states s after a set historical moment j _j ，…，s _j+t And all actions a _j ，…，a _j+t 。

That is, the prize value to be optimized is essentially the expectation of the cumulative prize value for the entire travel trajectory from the set historical time j to the completion of travel for the given device.

In the embodiments of the present specification, the evaluation submodel includes: a first evaluation submodel and a second evaluation submodel. The server can input the interested data and the control parameters into the first evaluation submodel, and predict the corresponding reward value of the specified device after driving according to the control parameters at the set historical moment, wherein the reward value is used as the first reward value to be optimized. The concrete formula is as follows:

in the above formula, w _1,now May be used to characterize the current model parameters of the first evaluation submodel.

Can be used to characterize a first value of the reward to be optimized after a given device has driven according to the control parameters at a set historical moment j, based on a first evaluation submodel. Wherein the content of the first and second substances,

it may refer to a state-action cost function for characterizing the expectation of the cumulative reward value of the entire travel trajectory from the set history time j to the completion of travel of the specified device based on the first evaluation submodel. That is, the first value to be optimized for the reward is substantially based on the first evaluation submodel, specifying the device from the set historical moment j to the rowAnd (4) the expectation of the accumulated reward value of all the driving tracks.

The server can input the interested data and the control parameters into the second evaluation submodel, and predict the corresponding reward value of the specified device after driving according to the control parameters at the set historical moment, wherein the reward value is used as a second reward value to be optimized. The concrete formula is as follows:

in the above formula, w _2,now May be used to characterize the current model parameters of the second evaluation submodel.

The method can be used for representing a corresponding second reward value to be optimized after the appointed device drives according to the control parameters at the set historical moment j based on the second evaluation submodel. Wherein the content of the first and second substances,

it may refer to a state-action cost function for characterizing the expectation of the cumulative reward value of the entire travel trajectory from the set history time j to the completion of travel of the specified device based on the second evaluation submodel. That is, the second value of the reward to be optimized is substantially based on the second evaluation submodel, and the expectation of the cumulative reward value of all the travel trajectories from the set history time j to the completion of the travel of the device is specified.

Further, the server may train a first evaluation submodel in the decision model with the first to-be-optimized bonus value approaching the actual bonus value as an optimization goal, and train a second evaluation submodel in the decision model with the second to-be-optimized bonus value approaching the actual bonus value as an optimization goal.

In an embodiment of the present specification, the decision model includes: a target evaluation submodel and a target strategy submodel. The server can determine the reward value of the designated device under the environment at the set historical time according to the interested data and the control parameter, and the reward value is used as the reward value corresponding to the set historical time.

Because the server needs to ensure that the appointed equipment does not collide with surrounding obstacles in the process of driving according to the control parameters output by the decision model, and meanwhile, the server also needs to further ensure the driving efficiency and the stability when driving according to the control parameters output by the decision model. Based on the three aspects, the server can determine the real reward value corresponding to the designated device at the set historical moment.

Specifically, the server may determine a first influence factor corresponding to the set historical time according to the data of interest and the control parameter. The server may then determine the reward value of the designated device in the environment at the set historical time based on a first impact factor, where the first impact factor mentioned herein may be used to characterize the time difference between the time when the designated device reaches the designated point and the time when each obstacle reaches the designated point, and the greater the time difference, the greater the reward value of the designated device in the environment at the set historical time. The specified points mentioned here may be used to characterize the travel path corresponding to the specified device, the overlap point of the travel path corresponding to each obstacle. The designated points mentioned herein may also be used to characterize the location points at which the designated device and various obstacles are likely to arrive in the future. Specifically, the following formula can be referred to:

r _safe ＝ttc ₁

wherein r is _safe May be used to represent the reward value in terms of collision that would result if the given device were to travel according to the control parameters at the set historical time. ttc ₁ The time difference between the time when the designated device reaches the designated point and the time when each obstacle reaches the designated point can be represented, and the specific form can be various, for example, the average time difference between the designated device and the obstacle in the process of setting the historical time to drive according to the control parameters.

Of course, ttc ₁ The reward value may be a set numerical value corresponding to a time difference between a time when the designated device reaches the designated point and a time when each obstacle reaches the designated point, and ttc may be performed if the time difference is smaller than a set threshold value ₁ Is a smaller settingThe value of the reward value is equal to or less than a set value, and ttc is determined if the time difference is not less than a set threshold value ₁ A prize value for a larger set value.

Note that ttc is described above ₁ I.e. can be understood as the first influencing factor, regardless of ttc ₁ In which form, the above formula can be used to characterize, if it is determined that the designated device collides with the obstacle, r can be calculated _safe The reward value is subtracted by a preset maximum value, so that when the decision model trained in the mode is used for decision making, collision between the designated equipment and the obstacle can be effectively avoided.

In this embodiment, the server may determine the second influence factor corresponding to the set historical time according to the data of interest and the control parameter. Then, the server may determine the reward value of the designated device in the environment at the set history time according to a second influence factor, where the second influence factor mentioned here may be used to represent the traffic efficiency of the designated device when the designated device travels according to the control parameter, and the greater the traffic efficiency, the greater the reward value of the designated device in the environment at the set history time. Specifically, the following formula can be referred to:

in the above formula, r _pass May be used to represent the reward value in terms of traffic efficiency that would result if a given device were to travel according to the control parameters at a set historical time. v may be used to characterize the travel speed of a given device at a set historical time. (v + Δ v) may be used to characterize the travel speed of a given device while traveling in accordance with the control parameters. v. of _max The maximum driving speed of the equipment is specified in the road scene. As can be seen in the above formula, the greater the speed at which the specified device travels in accordance with the control parameters at the set history time, the greater the reward value in terms of traffic efficiency.

In this embodiment, the server may determine the third influence factor corresponding to the set historical time according to the data of interest and the control parameter. The server may then determine, based on the third impact factor, a reward value for the environment in which the given device is located at the set historical time. The third influence factor mentioned here may be used to characterize the degree of change of the state of the designated device after driving according to the control parameter, and the larger the degree of change of the state, the larger the bonus value of the designated device in the environment in which the designated device is located at the set history time. Specifically, the following formula can be referred to:

wherein r is _soft May be used to represent the reward value in terms of the degree of change in state that would result if the specified device were traveling in accordance with the control parameter at the set history time, | Δ ν | may be used to represent the rate of change of speed at which the specified device was traveling in accordance with the control parameter at the set history time. The acceleration change rate may be used to indicate a change rate of acceleration when the designated device travels according to the control parameter at the set historical time, and may be specifically determined according to a service requirement.

May be used to indicate the rate of change of the steering angle when the specified device is traveling in accordance with the control parameter at the set history time. As can be seen from this equation, a greater rate of change in the speed of the designated device indicates a less stable condition of the designated device when traveling according to the control parameters at the set historical time, so r _soft The smaller the steering angle, the greater the change rate of the steering angle of the designated equipment, the worse the stability, and r, when the designated equipment drives according to the control parameters at the set historical moment _soft The smaller.

Of course, the | Δ v | may be used to indicate the rate of change of the acceleration when the designated device travels according to the control parameter at the set history time, and the designated device may be selected according to the traffic demand.

Further, the server may determine the prize value according to the above-described manner: r is _safe 、r _pass 、r _soft To determine the actual prize value corresponding to the given device at the set historical time.

It should be noted that, there may be various specific forms of the reward function used by the server for training the decision model, as long as the relationship between the reward value and the time difference is represented positively, the relationship between the reward value and the traffic efficiency is represented positively, and the relationship between the reward value and the state change degree is represented negatively, and the specific form of the reward function is not limited in this specification.

Secondly, the server can predict the interested data of the specified device after driving according to the control parameters at the set historical time through the attention mechanism network, the predicted interested data is used as the predicted interested data, the predicted interested data is input into the target strategy sub-model, and the control parameters of the specified device after the set historical time are determined and used as the predicted control parameters. Specifically, the following formula can be referred to:

in the above-mentioned formula,

may be used to characterize the target-based strategy submodel, specifying the predicted control parameters of the device after setting the historical time.

Can be used to characterize the current model parameters of the target strategy submodel. Xi can be used to characterize noise, and xi can be independently derived from truncating a normal distribution

To be randomly selected.

The server may then enter the predicted interest data and the predictive control parameters into the target evaluation submodel, determining the predicted reward value for the given device after setting the historical time.

Finally, the server can determine the actual reward value according to the reward value corresponding to the set historical moment and the predicted reward value.

In practical application, when the decision model updates the model parameters each time, the model parameters of the decision model are updated through the currently determined control parameters with the largest predicted reward value, so that overestimation of the predicted reward value can occur. Based on the method, the server can determine different prediction reward values by using two sub models, and the model parameters are updated by selecting smaller prediction reward values to inhibit continuous overestimation.

In the embodiment of the present specification, the target evaluation submodel includes: the system comprises a first target evaluation submodel and a second target evaluation submodel. The server may input the predicted interest data and the prediction control parameter into the first target evaluation submodel, and determine a predicted reward value for the given device after setting the historical time as a first candidate reward value. The following formula can be specifically referred to:

in the above-mentioned formula,

can be used to characterize the current model parameters of the first target evaluation submodel.

The method can be used for characterizing a corresponding first candidate reward value after the appointed device drives according to the control parameter at the set historical moment j +1 based on the first target evaluation submodel. Wherein the content of the first and second substances,

it may refer to a state-action cost function for characterizing the expectation of the cumulative reward value of the entire travel trajectory from the set history time j to the completion of travel of the specified device based on the first target evaluation submodel.That is, the first candidate prize value substantially specifies the expectation of the cumulative prize value of the entire travel locus from the set history time j +1 to the completion of the travel based on the first target evaluation submodel.

The server may input the predicted interest data and the prediction control parameter into a second target evaluation submodel, and determine the predicted reward value for the given device after setting the historical time as a second candidate reward value. Specifically, the following formula can be referred to:

in the above-mentioned formula,

can be used to characterize the current model parameters of the second target evaluation submodel.

The method can be used for characterizing a corresponding second candidate reward value after the designated device drives according to the control parameter at the set historical moment j +1 based on the second target evaluation submodel. Wherein the content of the first and second substances,

may refer to a state-action cost function for characterizing the expectation of the cumulative reward value of the entire travel trajectory from the set historical time j to the completion of travel of the specified device based on the second target evaluation submodel. That is, the second candidate prize value substantially specifies the expectation of the cumulative prize value of the entire travel locus from the set history time j +1 to the completion of the travel based on the second target evaluation submodel.

Finally, the server may use the smaller of the first and second candidate prize values as the predicted prize value. Specifically, the following formula can be referred to:

in the above-mentioned formula,

the method can be used for representing the corresponding actual reward value of the designated equipment after the designated equipment drives according to the control parameters at the set historical moment.

May be used to indicate that a smaller prize value is selected among the first candidate prize value and the second candidate prize value. γ is a discount factor for reducing the influence of the predicted prize value predicted at other times after the setting of the historical time j. r is _j May be used to represent the prize value for the environment in which the given device is located at the set historical time.

Further, the evaluating sub-model includes: a first evaluation submodel and a second evaluation submodel. The server can input the interested data and the control parameters into the first evaluation submodel, and predict the corresponding reward value of the designated equipment after driving according to the control parameters at the set historical moment, wherein the reward value is used as the first reward value to be optimized. Specifically, the following formula can be referred to:

in the above-mentioned formula, the first and second,

can be used to characterize the actual prize value of a given device after driving in accordance with the control parameters at a set historical time.

The method can be used for characterizing a corresponding first to-be-optimized reward value after the designated device drives according to the control parameters at the set historical moment j based on the first evaluation submodel. Delta _1,j May be used to characterize the difference between the actual prize value and the first value to be optimized.

Similarly, the server may input the interested data and the control parameter into the second evaluation submodel, and predict the reward value corresponding to the designated device after driving according to the control parameter at the set historical time, as the second reward value to be optimized. Specifically, the following formula can be referred to:

in the above-mentioned formula,

The method can be used for representing a corresponding second reward value to be optimized after the appointed device drives according to the control parameters at the set historical moment j based on the second evaluation submodel. Delta _2,j May be used to characterize the difference between the actual prize value and the second prize value to be optimized.

Then, the server can train a first evaluation submodel in the decision model by taking the first value to be optimized to approach the actual reward value as an optimization target.

In this embodiment, the server may complete training of the decision model by updating model parameters of each sub-model in the model structure of the decision model. As shown in particular in fig. 2.

Fig. 2 is a schematic diagram of a model structure of a decision model provided in an embodiment of the present specification.

In FIG. 2, the server may input data of interest into the policy model and determine the predictive control parameters output by the policy model. Secondly, the server can input the predictive control parameters output by the strategy model into the first evaluation submodel to determine a first reward value to be optimized, and input the predictive control parameters output by the strategy model into the second evaluation submodel to determine a second reward value to be optimized.

Likewise, the server may input the data of interest into the target policy model and determine the predictive control parameters output by the target policy model. The server may input the predictive control parameters output by the objective policy model into a first objective evaluation submodel to determine a first candidate reward value, and input the predictive control parameters output by the objective policy model into a second objective evaluation submodel to determine a second candidate reward value. The smaller value of the first candidate reward value and the second candidate reward value is selected to determine the actual reward value.

Finally, the server may use the difference between the first prize value to be optimized and the actual prize value as the first difference, and use the difference between the second prize value to be optimized and the actual prize value as the second difference. The server may update the first evaluation submodel based on the first difference and update the second evaluation submodel based on the second difference.

Specifically, for each round of training, the server may use the first to-be-optimized reward value approaching the actual reward value as an optimization target, update the step length based on the first parameter, and update the model parameter of the first evaluation submodel in the round of training. Specifically, the following formula can be referred to:

in the above-mentioned formula,

model parameters that may be used to characterize the first target evaluation submodel in the current round. w is a _1,new Can be used to characterize the model parameters of the first target evaluation submodel after the next round of updating. Alpha may be used to characterize the first parameter update step size.

Model parameters that can be used to characterize the updating of the first target evaluation submodel by a gradient descent method.

Similarly, the server may train a second evaluation submodel in the decision model with the optimization goal of the second value to be optimized approaching the actual reward value.

Specifically, the server may use the second reward value to be optimized to approach the actual reward value as the optimization target, and update the model parameters of the second evaluation sub-model in the round of training based on the first parameter update step length. Specifically, the following formula can be referred to:

in the above formula, w _2,now May be used to characterize the model parameters of the second target evaluation submodel in the current round. w is a _2,new Can be used for characterizing the model parameters of the second target evaluation submodel after the next round of updating. Alpha can be used to characterize the first parameter update step size and is set through expert experience.

Model parameters that can be used to characterize the updating of the second target evaluation submodel by a gradient descent method.

In this embodiment, the server may update the model parameters of the policy submodel in the round of training according to the model parameters of the first evaluation submodel in the round of training, the model parameters of the second evaluation submodel in the round of training, and the second parameter update step length. The following formula can be specifically referred to:

in the above formula, θ _now May be used to characterize the model parameters of the strategy sub-model in the current run. Theta _new Can be used to characterize the model parameters of the strategy submodel after the next round of updating. Beta can be used for characterizing the updating step length of the second parameter and is set through expert experience.

Can be used to characterize the model parameters for updating the strategy submodel by a gradient descent method based on the chain rule.

The method can be used for representing the control parameters corresponding to the designated equipment at the set historical moment, which are determined by the strategy submodel according to the interesting data corresponding to the designated equipment at the set historical moment, and the specific formula is

The server can randomly select one submodel from the model parameters of the first evaluation submodel in the round of training and the model parameters of the second evaluation submodel in the round of training, and the submodel is used for updating the model parameters of the strategy submodel.

Finally, the server can update the model parameters of the target strategy submodel in the round of training according to the model parameters and the soft update coefficients of the strategy submodel in the round of training until preset conditions are met, so as to finish the training of the target strategy submodel in the strategy model. Specifically, the following formula can be referred to:

in the above-mentioned formula, the first and second,

may be used to characterize the model parameters of the target strategy sub-model in the current run. Theta _new Can be used to characterize the model parameters of the strategy submodel after the next round of updating.

May be used to characterize the updated model parameters of the target strategy sub-model. Tau is a parameter soft updating coefficient and is set through expert experience.

In the above-mentioned formula,

model parameters that may be used to characterize the first target evaluation submodel in the current round. w is a _1,new Can be used to characterize the model parameters of the first evaluation submodel after the next round of updating.

May be used to characterize the updated model parameters of the first target evaluation submodel. Tau is a parameter soft updating coefficient and is set through expert experience.

In the above-mentioned formula,

may be used to characterize the model parameters of the second target evaluation submodel in the current round. w is a _2,new Can be used to characterize the model parameters of the second evaluation submodel after the next round of updating.

May be used to characterize the updated model parameters of the second target evaluation submodel. Tau is a parameter soft updating coefficient and is set through expert experience.

As can be seen from the above formula, part of the model parameters in the current model are updated each time. Even if the target network is updated every iteration, certain stability can be maintained. The smaller the parameter soft update coefficient is, the smaller the target network parameter change is, and the slower the algorithm convergence speed is.

It should be noted that the strategy submodel and the target strategy submodel have the same model structure, and the model parameters may be determined according to the service requirements, and may be the same or different. The first evaluation submodel and the second evaluation submodel have the same model structure, and the model parameters can be determined according to the service requirements, and can be the same or different. The first target evaluation submodel and the second target evaluation submodel have the same model structure, and the model parameters can be determined according to business requirements, and can be the same or different.

That is, the server may also update the model parameters of the strategy submodel in the training of the round until a preset condition is met, so as to complete the training of the strategy submodel in the strategy model. Specifically, the strategy submodel or the target strategy submodel can be selected to be the main one according to the service requirement. For example, if the target strategy submodel is not updated by the strategy submodel and the parameter soft update coefficients in the last round of training, the strategy submodel may be applied in practical applications. If the target strategy submodel is updated through the strategy submodel and the parameter soft updating coefficient in the last round of training, the target strategy submodel can be applied in practical application.

In this embodiment of the present disclosure, the preset condition may refer to that the decision model undergoes multiple rounds of iterative training, so that the difference between the first reward value to be optimized and the actual reward value and the difference between the second reward value to be optimized and the actual reward value are continuously reduced and converged within a value range, thereby completing the training process of the decision model.

Of course, in addition to training the decision model with the minimum difference as the optimization goal, the preset condition may also be training the decision model by adjusting the model parameters included in the decision model with the preset difference as the optimization goal. That is, in the process of multiple rounds of iterative training, the difference value needs to be continuously close to the preset difference value, and after the multiple rounds of iterative training, the target reward value floats back and forth around the preset difference value, so that the training of the decision model can be determined to be completed.

In the process, the historical state data can be input into the pre-trained long-short term memory network by the method, and the predicted state data corresponding to each obstacle is determined. And determining interesting data from the environmental data of the environment where the designated equipment is located at the set historical moment through the weight corresponding to the attention mechanism network in the decision model. And then, according to the interested data and the control parameters, determining a corresponding reward value of the designated equipment after driving according to the control parameters at the set historical moment, and measuring the reasonable degree of the determined control parameters. By training the decision model in such a way, the probability of collision between the designated equipment and surrounding obstacles can be reduced, and collision between the designated equipment and the surrounding obstacles is avoided, so that safe driving of the designated equipment is effectively guaranteed.

In the training process of the model, the time difference between the time when the designated equipment reaches the designated point and the time when each barrier reaches the designated point is considered, and the passing efficiency of the designated equipment and the state change degree of the designated equipment are also considered, so that the control parameters obtained by the trained decision model according to the state data can improve the safety in the driving process and the passing efficiency and the stability.

The execution subject of the model training method may also be a server, a computer, or other devices, that is, the server may obtain historical state data corresponding to the designated device and each obstacle at each historical time, input the historical state data into the decision model to be trained, determine a control parameter corresponding to the designated device at a set historical time, train the decision model based on the determined control parameter, and deploy the trained decision model to the designated device.

After the training of the decision model is completed, the embodiment of the present specification may deploy the trained decision model to the unmanned device to implement control over the unmanned device, as shown in fig. 3.

Fig. 3 is a schematic flow chart of a control method of an unmanned aerial vehicle provided in an embodiment of the present specification, which specifically includes:

s300: and acquiring state data corresponding to the unmanned equipment and each obstacle at the current moment as current state data.

S302: and inputting the current state data into a long-short term memory network trained in advance to predict the state data of each obstacle after the current time as predicted state data.

S304: and inputting the current state data and the predicted state data into a trained decision model, and determining control parameters corresponding to the unmanned equipment at the current moment, wherein the decision model is obtained by training through the model training method.

S306: and controlling the unmanned equipment according to the control parameters corresponding to the unmanned equipment at the current moment.

In this embodiment, the unmanned device may acquire, as current state data, state data corresponding to the unmanned device and each obstacle at the current time through various sensors (such as a camera and a laser radar) provided in the unmanned device. Next, the unmanned device may input the current state data into a long-short term memory network trained in advance, and predict state data of each obstacle after the current time based on a preset time window, as predicted state data corresponding to the current time. And then, the unmanned equipment can input the predicted state data corresponding to the current moment into the decision model, and determine the control parameters corresponding to the unmanned equipment at the current moment. And finally, the unmanned equipment can control the unmanned equipment according to the control parameters corresponding to the unmanned equipment at the current moment.

Specifically, the unmanned device may input the predicted state data corresponding to the current time into a policy submodel in the decision model, and determine the control parameter corresponding to the unmanned device at the current time. Of course, if the target policy submodel is updated based on the policy submodel and the soft parameter update coefficient, the unmanned device may also input the predicted state data corresponding to the current time into the target policy submodel in the decision model, and determine the control parameter corresponding to the unmanned device at the current time. Specifically, the strategy submodel or the target strategy submodel may be selected according to the service requirement, and the control parameter corresponding to the current time is determined.

The execution main body of the control method of the unmanned device provided by the specification can be the unmanned device, and can also be a terminal device such as a server and a desktop computer. If the terminal device such as the server and the desktop computer is used as the execution main body, the terminal device can acquire the state data acquired and uploaded by the unmanned device, and can return the control parameters corresponding to the current time to the unmanned device after determining the control parameters corresponding to the current time.

Based on the same idea, the present specification further provides a corresponding model training apparatus, as shown in fig. 4.

Fig. 4 is a schematic structural diagram of a model training device provided in an embodiment of the present specification, which specifically includes:

an obtaining module 400, configured to obtain historical state data corresponding to the designated device and each obstacle at each historical time;

a prediction module 402, configured to input the historical state data into a pre-trained long-short term memory network according to a time sequence, so as to predict state data of each obstacle after a set historical time as predicted state data;

an input module 404, configured to input the historical state data and the predicted state data into a decision model to be trained, so as to determine, through a weight corresponding to an attention mechanism network, data of interest from environment data of an environment where the specified device is located at the set historical time;

a determining module 406, configured to determine, according to the data of interest, a control parameter corresponding to the specified device at the set historical time;

the training module 408 is configured to determine, according to the interested data and the control parameter, an incentive value corresponding to the designated device after driving according to the control parameter at the set historical time, and train the decision model according to the incentive value.

Optionally, the decision model comprises: evaluating the submodels;

the training module 408 is specifically configured to input the interest data and the control parameter into the evaluation submodel, predict a reward value corresponding to the designated device after driving according to the control parameter at the set historical time, as a reward value to be optimized, determine, according to the interest data and the control parameter, an actual reward value corresponding to the designated device after driving according to the control parameter at the set historical time, and train the decision model by using the reward value to be optimized approaching the actual reward value as an optimization target.

the training module 408 is specifically configured to determine, according to the interest data and the control parameter, an award value of the designated device in an environment where the designated device is located at the set historical time, as an award value corresponding to the set historical time, predict, through the attention mechanism network, interest data of the designated device after driving according to the control parameter at the set historical time, as predicted interest data, input the predicted interest data into the target policy sub-model, determine a control parameter of the designated device after the set historical time, as a predicted control parameter, input the predicted interest data and the predicted control parameter into the target evaluation sub-model, determine a predicted award value of the designated device after the set historical time, and according to the award value corresponding to the set historical time and the predicted award value, the actual prize value is determined.

the training module 408 is specifically configured to input the predicted interest data and the prediction control parameter into the first goal evaluation submodel, determine a predicted reward value of the designated device after the set historical time as a first candidate reward value, input the predicted interest data and the prediction control parameter into the second goal evaluation submodel, determine a predicted reward value of the designated device after the set historical time as a second candidate reward value, and use a smaller value of the first candidate reward value and the second candidate reward value as the predicted reward value.

Optionally, the training module 408 is specifically configured to determine, according to the data of interest and the control parameter, a first influence factor corresponding to the set historical time, and determine, according to the first influence factor, a reward value of the specific device in an environment where the set historical time is located, where the first influence factor is used to represent a time difference between a time when the specific device reaches a specific point and a time when each obstacle reaches the specific point, and the greater the time difference is, the greater the reward value of the specific device in the environment where the set historical time is located is.

Optionally, the training module 408 is specifically configured to determine, according to the data of interest and the control parameter, a second influence factor corresponding to the set history time, and determine, according to the second influence factor, a reward value of the designated device in an environment where the designated device is located at the set history time, where the second influence factor is used to represent a traffic efficiency of the designated device when the designated device travels according to the control parameter, and the greater the traffic efficiency, the greater the reward value of the designated device in the environment where the designated device is located at the set history time.

Optionally, the training module 408 is specifically configured to determine, according to the data of interest and the control parameter, a third influence factor corresponding to the set history time, and determine, according to the third influence factor, an award value of the specific device in an environment where the specific device is located at the set history time, where the third influence factor is used to represent a state change degree of the specific device after driving according to the control parameter, and the award value of the specific device in the environment where the specific device is located at the set history time is smaller as the state change degree is larger.

the training module 408 is specifically configured to input the data of interest and the control parameter into the first evaluation submodel, predicting a reward value corresponding to the appointed equipment after driving according to the control parameter at the set historical moment, taking the reward value as a first reward value to be optimized, inputting the interested data and the control parameter into the second evaluation submodel, predicting a reward value corresponding to the appointed equipment after driving according to the control parameter at the set historical moment, taking the reward value to be optimized as a second reward value to be optimized, taking the first reward value to be optimized approaching the actual reward value as an optimization target, and training a first evaluation submodel in the decision model, and training a second evaluation submodel in the decision model by taking the second reward value to be optimized to approach the actual reward value as an optimization target.

the training module 408 is specifically configured to, for each round of training, update the model parameters of the first evaluation submodel in the round of training by using the first to-be-optimized bonus value as an optimization target and updating the model parameters of the first evaluation submodel in the round of training based on a first parameter update step length, update the model parameters of the second evaluation submodel in the round of training based on the first parameter update step length, update the model parameters of the strategy submodel in the round of training based on the model parameters of the first evaluation submodel in the round of training, the model parameters of the second evaluation submodel in the round of training and a second parameter update step length, update the model parameters of the target strategy submodel in the round of training based on the model parameters of the strategy submodel in the round of training and a soft update coefficient, and completing the training of the target strategy sub-model in the decision model until a preset condition is met.

Fig. 5 is a schematic structural diagram of a control device of an unmanned aerial vehicle provided in an embodiment of this specification, which specifically includes:

an obtaining module 500, configured to obtain state data corresponding to the unmanned device and each obstacle at the current time as current state data;

a prediction module 502, configured to input the current state data into a pre-trained long-term and short-term memory network, so as to predict state data of each obstacle after the current time as predicted state data;

a determining module 504, configured to input the current state data and the predicted state data into a trained decision model, and determine a control parameter corresponding to the unmanned device at the current time, where the decision model is obtained by training through the model training method;

and the control module 506 is configured to control the unmanned device according to a control parameter corresponding to the unmanned device at the current time.

The present specification also provides a computer-readable storage medium having stored thereon a computer program operable to execute the method of model training provided in fig. 1 above or the method of controlling an unmanned aerial device provided in fig. 3 above.

The present specification also provides a schematic structural diagram of the drone shown in fig. 6. As shown in fig. 6, the drone includes, at the hardware level, a processor, an internal bus, a network interface, a memory, and a non-volatile memory, although it may include hardware required for other services. The processor reads a corresponding computer program from the non-volatile memory into the memory and then runs the computer program to implement the model training method described in fig. 1 or the control method of the unmanned aerial vehicle provided in fig. 3. Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD) (e.g., a Field Programmable Gate Array (FPGA)) is an integrated circuit whose Logic functions are determined by a user programming the Device. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, apparatuses, modules or units described in the above embodiments may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of model training, comprising:

inputting the historical state data into a pre-trained long-short term memory network according to a time sequence to predict state data of each obstacle after a set historical moment as predicted state data;

2. The method of claim 1, wherein the decision model comprises: evaluating the submodels;

training the decision model according to the reward value, including:

and training the decision model by taking the reward value to be optimized approaching the actual reward value as an optimization target.

3. The method of claim 2, wherein the decision model comprises: a target evaluation submodel and a target strategy submodel;

according to the interested data and the control parameters, determining the corresponding actual reward value of the designated equipment after driving according to the control parameters at the set historical moment, wherein the actual reward value comprises the following steps:

determining the reward value of the designated equipment under the environment at the set historical moment according to the interested data and the control parameters, and taking the reward value as the reward value corresponding to the set historical moment;

4. The method of claim 3, wherein the target evaluation submodel comprises: a first target evaluation submodel and a second target evaluation submodel;

5. The method of claim 3, wherein determining a reward value for the given device in the environment of the set historical time based on the data of interest and the control parameter comprises:

6. The method of claim 3, wherein determining the reward value of the given device for the environment in which the given device is located at the set historical time based on the data of interest and the control parameter comprises:

and determining the reward value of the designated equipment under the environment at the set history moment according to the second influence factor, wherein the second influence factor is used for representing the passing efficiency of the designated equipment when the designated equipment runs according to the control parameters, and the greater the passing efficiency is, the greater the reward value of the designated equipment under the environment at the set history moment is.

7. The method according to claim 3, wherein determining the reward value of the designated device in the environment of the set historical time according to the data of interest and the control parameter includes:

8. The method of claim 2, wherein the evaluating submodels comprises: a first evaluation submodel and a second evaluation submodel;

9. The method of claim 8, wherein the decision model comprises: a strategy sub-model and a target strategy sub-model;

10. A method of controlling an unmanned aerial device, comprising:

inputting the current state data and the predicted state data into a trained decision model, and determining a control parameter corresponding to the unmanned equipment at the current moment, wherein the decision model is obtained by training through the method of any one of claims 1 to 9;

11. An apparatus for model training, comprising:

12. A control apparatus for an unmanned aerial device, comprising:

a determining module, configured to input the current state data and the predicted state data into a trained decision model, and determine a control parameter corresponding to the unmanned device at a current time, where the decision model is obtained by training according to the method of any one of claims 1 to 9;

13. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of claims 1 to 9 or 10.

14. An unmanned aerial device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the method of any of claims 1 to 9 or 10.