CN112925210B

CN112925210B - Method and device for model training and unmanned equipment control

Info

Publication number: CN112925210B
Application number: CN202110508067.6A
Authority: CN
Inventors: 刘思威; 白钰; 贾庆山; 任冬淳; 樊明宇; 夏华夏; 毛一年
Original assignee: Tsinghua University; Beijing Sankuai Online Technology Co Ltd
Current assignee: Tsinghua University; Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-09-07
Anticipated expiration: 2041-05-11
Also published as: CN112925210A

Abstract

The embodiment of the specification independently arranges a security verification module outside a reinforcement learning network, when a control strategy output by the reinforcement learning network cannot pass the security verification, a control strategy output by a non-reinforcement learning algorithm module is adopted as a control strategy to be optimized, and the reinforcement learning network is trained by aiming at maximizing rewards obtained by the control strategy to be optimized. Therefore, when the reinforcement learning model comprising the reinforcement learning network and the non-reinforcement learning algorithm module is trained, a high-precision virtual simulation system is not needed, the cost of the real vehicle test is greatly reduced, and the trained reinforcement learning model can be suitable for various complex scenes.

Description

Method and device for model training and unmanned equipment control

Technical Field

The specification relates to the technical field of automatic driving, in particular to a method and a device for model training and unmanned equipment control.

Background

With the development of the automatic driving technology, the unmanned device gradually plays more and more important roles in various fields, and how to control the unmanned device becomes a hot problem. Among many solutions for solving the problem of controlling the unmanned device, the reinforcement learning model is widely used due to its advantages of being less dependent on a sample, being capable of giving an optimal control strategy, and the like.

The core technical idea of the reinforcement learning process is that state information such as states (such as positions, speeds and the like) of unmanned equipment and states of obstacles is abstracted into an environment S, the environment S is input into a reinforcement learning model to obtain a control strategy a output by the reinforcement learning model, the unmanned equipment can influence the environment S under the control of the control strategy a to change the environment S into the environment S ', then the environment S ' is input into the reinforcement learning model, and the control strategy a ' is continuously output by the unmanned equipment, and the like. Each time the environment S is changed, the reward R can be determined for the reinforcement learning model according to the changed environment, and the reinforcement learning process essentially solves the control strategy that maximizes the expected value of the total reward obtained.

In the reinforcement learning process, the size of the reward obtained each time the environment is changed generally depends on several criteria, such as whether the unmanned device can reach the destination as soon as possible, whether the driving is smooth, whether the trajectory is smooth, whether it is safe, and the like.

For the index of safety, the method of reinforcement learning is not suitable. This is because the control strategy a output by the reinforcement learning model must change the environment S to obtain the reward according to the changed environment S', and then evaluate whether the control strategy a is safe according to the obtained reward. That is, the control strategy output by the reinforcement learning model must act on the environment before learning the experience, so for the important index of safety, a virtual simulation system with extremely high precision is needed to simulate how the control strategy output by the reinforcement learning model acts on the environment and then changes the environment, or a real unmanned device is directly adopted to carry out a real vehicle test under the real environment to determine how the control strategy output by the reinforcement learning model acts on the environment and then changes the environment.

Because the virtual simulation system meeting the precision requirement is difficult to obtain, and the trial and error cost is too high by adopting a real-vehicle test method, the reinforcement learning model is difficult to be applied to an unmanned actual environment in a large scale in the prior art.

Disclosure of Invention

The embodiment of the specification provides a method and a device for model training and controlling unmanned equipment, so as to partially solve the problems in the prior art.

The embodiment of the specification adopts the following technical scheme:

the present specification provides a method for model training, including:

respectively inputting environmental information into a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model to obtain a first control strategy output by the reinforcement learning network and a second control strategy output by the non-reinforcement learning algorithm module;

determining a security characterization value of the first control strategy;

selecting a control strategy to be optimized from the first control strategy and the second control strategy according to the safety characterization value;

determining changed environment information according to the control strategy to be optimized;

determining the reward of the control strategy to be optimized according to the changed environmental information;

and adjusting parameters of a reinforcement learning network in the reinforcement learning model by taking the reward maximization as a training target, wherein the trained reinforcement learning model is used for outputting a control strategy for controlling the unmanned equipment.

Optionally, determining the safety characterizing value of the first control policy specifically includes:

obtaining a first simulation result output by the first simulation model according to the first control strategy and a preset first simulation model;

determining a safety characteristic value of a first control strategy according to the first simulation result;

determining the changed environment information according to the control strategy to be optimized, which specifically comprises:

obtaining a second simulation result output by the second simulation model according to the control strategy to be optimized and a preset second simulation model;

determining changed environment information according to the second simulation result;

wherein the simulation precision of the first simulation model is lower than that of the second simulation model.

Optionally, selecting a control policy to be optimized from the first control policy and the second control policy according to the safety characterization value specifically includes:

judging whether the safety characteristic value is lower than a preset threshold value or not;

if so, taking the second control strategy as a control strategy to be optimized;

and if not, taking the first control strategy as a control strategy to be optimized.

Optionally, the method further comprises:

and adjusting the trained reinforcement learning model according to the difference between the simulation precision of the second simulation model and the simulation precision of the first simulation model.

Optionally, the method further comprises:

determining and storing a compensation parameter according to the difference between the simulation precision of the second simulation model and the simulation precision of the first simulation model; and the compensation parameters are used for compensating the control strategy output by the trained reinforcement learning model.

The present specification provides a method of controlling an unmanned aerial device, the method comprising:

acquiring current environmental information of the unmanned equipment;

inputting the environment information into a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model to obtain a first control strategy output by the reinforcement learning network and a second control strategy output by the non-reinforcement learning algorithm module;

determining a security characterization value of the first control strategy;

selecting a final control strategy from the first control strategy and the second control strategy according to the safety characterization value;

and controlling the unmanned equipment according to the final control strategy.

Optionally, selecting a final control policy from the first control policy and the second control policy according to the safety characterization value, specifically including:

if so, taking the second control strategy as a final control strategy;

otherwise, the first control strategy is used as a final control strategy.

Optionally, controlling the unmanned device according to the final control strategy specifically includes:

compensating the final control strategy according to a prestored compensation parameter;

and controlling the unmanned equipment according to the compensated final control strategy.

The present specification provides an apparatus for model training, the apparatus comprising:

the input module is used for respectively inputting environmental information into a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model to obtain a first control strategy output by the reinforcement learning network and a second control strategy output by the non-reinforcement learning algorithm module;

the safety evaluation module is used for determining a safety representation value of the first control strategy;

the selection module is used for selecting a control strategy to be optimized from the first control strategy and the second control strategy according to the safety characterization value;

the feedback module is used for determining changed environment information according to the control strategy to be optimized;

the reward module is used for determining reward of the control strategy to be optimized according to the changed environment information;

and the training module is used for adjusting parameters of a reinforcement learning network in the reinforcement learning model by taking the reward maximization as a training target, wherein the trained reinforcement learning model is used for outputting a control strategy for controlling the unmanned equipment.

The present specification provides an apparatus for controlling an unmanned aerial device, the apparatus comprising:

the acquisition module is used for acquiring the current environmental information of the unmanned equipment;

the input module is used for inputting the environment information into a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model to obtain a first control strategy output by the reinforcement learning network and a second control strategy output by the non-reinforcement learning algorithm module;

the selection module is used for selecting a final control strategy from the first control strategy and the second control strategy according to the safety characterization value;

and the control module is used for controlling the unmanned equipment according to the final control strategy.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described method of model training and method of controlling an unmanned aerial device.

The present specification provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the above-mentioned model training method and the above-mentioned method for controlling an unmanned aerial device.

The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects:

in the embodiment of the description, a security verification module is independently arranged outside the reinforcement learning network, when the control strategy output by the reinforcement learning network cannot pass the security verification, the control strategy output by a non-reinforcement learning algorithm module is adopted as the control strategy to be optimized, and the reinforcement learning network is trained with the aim of maximizing the reward which can be obtained by the control strategy to be optimized. Therefore, when the reinforcement learning model comprising the reinforcement learning network and the non-reinforcement learning algorithm module is trained, a high-precision virtual simulation system is not needed, the cost of the real vehicle test is greatly reduced, and the trained reinforcement learning model can be suitable for various complex scenes.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

FIG. 1 is a schematic diagram of a method for model training provided in an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a reinforcement learning model provided in an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a system for training the reinforcement learning model shown in FIG. 2 according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a method for controlling an unmanned aerial vehicle according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of an apparatus for model training according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an apparatus for controlling an unmanned aerial vehicle according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of this specification.

Detailed Description

The embodiment of the description aims to verify whether a control strategy output by the reinforcement learning network is safe or not through a safety verification module arranged independently of the reinforcement learning network so as to replace a simulation result or a real vehicle test result on a safety index required by the reinforcement learning network in a training process, and if the control strategy is not safe, the control strategy output by a non-reinforcement learning algorithm module is adopted for training.

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort belong to the protection scope of the present specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of a model training method provided in an embodiment of the present specification, including:

s100: and respectively inputting the environmental information into a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model to obtain a first control strategy output by the reinforcement learning network and a second control strategy output by the non-reinforcement learning algorithm.

In the embodiment of the present specification, the device for training the reinforcement learning model by the method shown in fig. 1 may be an unmanned device itself, or may be another device such as a server. The unmanned equipment in the embodiment of the present specification may include an unmanned vehicle and an unmanned aerial vehicle, and specifically may be an unmanned delivery vehicle. The unmanned device can be used for performing logistics distribution tasks, including immediate distribution tasks such as take-out and non-immediate distribution tasks.

The structure of the reinforcement learning model described in the embodiment of this specification is shown in fig. 2, and includes a reinforcement learning network and a non-reinforcement learning algorithm module, where the reinforcement learning network may be implemented by a DDQN network, and certainly may also be implemented by other reinforcement learning networks such as a DQN network and an A3C network, and the non-reinforcement learning algorithm module may be implemented by a non-reinforcement learning algorithm such as an EM-planner algorithm, and may also be implemented by a non-reinforcement learning network trained in advance.

When the reinforcement learning model shown in fig. 2 is trained, the environment information is respectively input into the reinforcement learning network and the non-reinforcement learning algorithm module in the reinforcement learning model, and a first control strategy and a second control strategy which are respectively output by the reinforcement learning network and the non-reinforcement learning algorithm module are obtained.

The environment information described in the embodiments of the present specification may include, but is not limited to, state information of the unmanned device itself (e.g., position, velocity, acceleration, etc.), and state information of surrounding obstacles (e.g., position, velocity, acceleration, etc.). In the training process of the reinforcement learning model, the environmental information input as a sample may be pre-stored environmental information, that is, pre-stored state information of the unmanned device and the obstacle at a historical time. Of course, the current environment information, that is, the state information of the unmanned device and the obstacle at the current time may be also possible.

S102: a security characterizing value of the first control strategy is determined.

The reinforcement learning model shown in fig. 2 includes a first simulation model in addition to the reinforcement learning network and the non-reinforcement learning algorithm.

After the first control strategy output by the reinforcement learning network is obtained, the first control strategy can be input into the first simulation model to obtain a first simulation result output by the first simulation model, and the safety characteristic value of the first control strategy is determined according to the first simulation result.

The first simulation model may be a pre-established simulation model of the unmanned aerial vehicle, the first simulation model has kinetic parameters of the unmanned aerial vehicle, and after the first control strategy is input into the first simulation model, the first simulation model may output a result of the first simulation model under the control of the first control strategy according to the first control strategy, that is, a first simulation result.

After the first simulation result is obtained, the safety characteristic value of the first control strategy can be determined according to a preset evaluation rule or a pre-trained evaluation model. A higher safety characterizing value indicates a safer first control strategy, and a lower safety characterizing value indicates a less safe first control strategy.

S104: and selecting a control strategy to be optimized from the first control strategy and the second control strategy according to the safety characterization value.

Specifically, whether the safety characteristic value of the first control strategy is lower than a preset threshold value or not can be judged, if yes, the second control strategy is used as the control strategy to be optimized, and if not, the first control strategy is used as the control strategy to be optimized.

That is to say, when the reinforcement learning network cannot output the first control strategy with sufficient safety, the second control strategy output by the non-reinforcement learning algorithm module is taken as the control strategy to be optimized for subsequent training, and when the reinforcement learning network can output the first control strategy with sufficient safety, the first control strategy is taken as the control strategy to be optimized for subsequent training.

S106: and determining changed environment information according to the control strategy to be optimized.

S108: and determining the reward of the control strategy to be optimized according to the changed environment information.

S110: and adjusting parameters of a reinforcement learning network in the reinforcement learning model by taking the reward maximization as a training target, wherein the trained reinforcement learning model is used for outputting a control strategy for controlling the unmanned equipment.

The steps S106 to S108 are the training process of reinforcement learning. Fig. 3 is a schematic structural diagram of a system for training the reinforcement learning model shown in fig. 2 according to an embodiment of the present disclosure, and as shown in fig. 3, in addition to the reinforcement learning model shown in fig. 2, a preset second simulation model is further included. After the control strategy to be optimized is obtained in step S104, the control strategy to be optimized may be input into the second simulation model to obtain a second simulation result output by the second simulation model, and the changed environmental information is determined according to the second simulation result.

The second simulation model may also be a pre-established simulation model of the unmanned aerial vehicle, the second simulation model has kinetic parameters of the unmanned aerial vehicle, and after the control strategy to be optimized is input into the second simulation model, the second simulation model may output a result of the second simulation model under the control of the control strategy to be optimized, that is, a second simulation result, according to the control strategy to be optimized. After the second simulation result is obtained, the changed environmental information can be obtained according to the second simulation result of the unmanned aerial vehicle and the state information of other obstacles around the unmanned aerial vehicle.

After the changed environment information is obtained, on one hand, the reward of the control strategy to be optimized at the moment can be determined according to the changed environment information, on the other hand, the changed environment information is input into the reinforcement learning model again, namely, the step S100 is returned to carry out iterative training, and finally, the parameters of the reinforcement learning network in the reinforcement learning model are adjusted by taking the reward expectation maximization obtained by each iteration as a training target. Because the non-reinforcement learning algorithm module in the reinforcement learning model is trained in advance or adjusted, the parameters of the non-reinforcement learning algorithm module are not required to be adjusted in the training process.

The first simulation model and the second simulation model may be models of virtual unmanned devices or actual physical models (i.e., actual vehicle models).

Since the training process is essentially a process of adjusting the parameters of the reinforcement learning network, and the purpose of adjusting the parameters of the reinforcement learning network is to make the reinforcement learning network learn, in a manner of maximizing the reward, what kind of first control strategy is suitable for the reinforcement learning network to output under what kind of environment information, and what kind of first control strategy is unsuitable for outputting, therefore, in the embodiment of the specification, when the safety characteristic value of the first control strategy output by the reinforcement learning network is lower than the preset threshold value, the reward of the control strategy to be optimized can be determined according to the changed environmental information and the preset reward, the preset reward is lower than the reward determined only according to the changed environment, and the reward of the control strategy to be optimized determined according to the changed environment information and the preset reward is also lower than the reward determined only according to the changed environment. For example, the preset reward may be a negative value reward, and the reward determined according to the changed environment information is a positive value reward, when the security representation value of the first control strategy is lower than the preset threshold, the minimum value of the reward determined only according to the changed environment information and the preset reward may be determined as the reward of the control strategy to be optimized. Therefore, when training is carried out by taking reward maximization as a training target, the reinforcement learning network can learn that the first control strategy of which the given safety characteristic value is lower than the preset threshold value is not suitable, but the control strategy according to which the environment is changed is the second control strategy of the control strategy to be optimized, so that the safety of the control strategy can be guaranteed, and the test cost can be greatly reduced no matter the virtual simulation system is used for obtaining changed environment information or a real vehicle test is used for obtaining the changed environment information.

By the method, when the reinforcement learning model comprising the reinforcement learning network and the non-reinforcement learning algorithm module shown in fig. 2 is trained, because the module for safety verification is designed outside the reinforcement learning network, when the first control strategy output by the reinforcement learning network is unsafe, the second control strategy output by the non-reinforcement learning algorithm is used for training, so that the control strategy to be optimized for changing the environmental information is ensured not to be dangerous, therefore, the training method does not need a high-precision virtual simulation system, and the trial-and-error cost of the real-time test is greatly reduced. Especially for a non-reinforcement learning algorithm module, the non-reinforcement learning algorithm module also participates in the reinforcement learning process, is effectively applied, and also enables the method to be used in various complex scenes such as intersections of urban roads and the like.

Further, the evaluation rule or the evaluation model for determining the safety characteristic value only needs to satisfy a specified safety standard, for example, a minimum safety distance (including a transverse minimum safety distance and a longitudinal minimum safety distance) between the unmanned device and the obstacle is predetermined, if a first simulation result of the first control strategy output by the reinforcement learning network makes the distance between the unmanned device and the obstacle smaller than the minimum safety distance, the safety characteristic value of the first control strategy is determined to be lower than a preset threshold, otherwise, the safety characteristic value of the first control strategy is determined to be not lower than the preset threshold. Therefore, the wrong cost of the real vehicle test can be effectively reduced.

In addition, in the embodiments of the present specification, the simulation accuracy of the first simulation model may be lower than that of the second simulation model. The first simulation model is located in the safety verification module, and the safety verification module only needs to ensure the specified safety, so that a simulation model with relatively low simulation precision can be selected as the first simulation model in order to improve the training speed as much as possible. The second simulation model shown in fig. 3 is used to determine the changed environment information, and the changed environment information is used to determine the reward of the control strategy to be optimized, and accordingly train the reinforcement learning network, so that in order to improve the training accuracy as much as possible, the second simulation model may adopt a simulation model with relatively high simulation accuracy.

Furthermore, because the first simulation model and the second simulation model are both preset and the simulation accuracy thereof is known, after the reinforcement learning model is trained, the reinforcement learning model after being trained can be adjusted according to the difference between the simulation accuracy of the second simulation model and the simulation accuracy of the first simulation model. That is, after the reinforcement learning model is trained using the low-precision simulation model as the first simulation model, the reinforcement learning model that can be trained when the high-precision second simulation model is used as the first simulation model can be determined directly from the difference in simulation precision between the high-precision second simulation model and the low-precision first simulation model and the reinforcement learning model after training. This is equivalent to that after the reinforcement learning model is trained by the first simulation model, if the trained reinforcement learning model is to be "migrated" to the second simulation model with higher accuracy, the reinforcement learning model is adjusted directly according to the difference in simulation accuracy without retraining. According to the use requirement, the trained reinforcement learning model can be directly transferred to the actual unmanned equipment according to the difference between the simulation precision of the first simulation model and the simulation precision of the actual unmanned equipment. Furthermore, after the trained reinforcement learning model is transferred to a second simulation model with higher precision or is transferred to actual unmanned equipment by the method, the transferred reinforcement learning model can be trained, only a small amount of training is needed when the transferred reinforcement learning model is trained, and a large amount of repeated training is not needed.

Of course, in addition to improving the accuracy of the trained reinforcement learning model by the "migration" method, a compensation parameter for compensating the control strategy output by the trained reinforcement learning model may be determined and stored according to the difference between the simulation accuracies of the second simulation model and the first simulation model.

Based on the training method of the reinforcement learning model, an embodiment of the present specification further provides a method for controlling an unmanned aerial vehicle, as shown in fig. 4.

Fig. 4 is a schematic diagram of a method for controlling an unmanned aerial vehicle according to an embodiment of the present disclosure, where the method includes the following steps:

s400: and acquiring the current environment information of the unmanned equipment.

In the embodiment of the present specification, when the unmanned aerial vehicle is controlled by the model training method shown in fig. 1, current environment information of the unmanned aerial vehicle may be obtained, where the environment information includes, but is not limited to, state information of the unmanned aerial vehicle itself and state information of obstacles around the unmanned aerial vehicle, and the state information includes, but is not limited to, a position, a speed, and an acceleration.

The acquisition of the environmental information can be obtained by a sensor installed on the unmanned equipment, such as one or more of an image acquisition device, a laser radar and an inertial measurement unit.

The execution subject for controlling the unmanned device by the method shown in fig. 4 may be the unmanned device itself, or may be another device such as a server.

S402: and inputting the environment information into a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model to obtain a first control strategy output by the reinforcement learning network and a second control strategy output by the non-reinforcement learning algorithm module.

In the embodiment of the present disclosure, the trained reinforcement learning model may only include the reinforcement learning network, the non-reinforcement learning algorithm module, and the security verification module as shown in fig. 2. After the environmental information of the unmanned aerial vehicle is acquired through step S400, the environmental information may be input to a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model, so as to obtain a first control strategy and a second control strategy that are respectively output by the reinforcement learning network and the non-reinforcement learning algorithm module.

S404: a security characterizing value of the first control strategy is determined.

S406: and selecting a final control strategy from the first control strategy and the second control strategy according to the safety characterization value.

Step S404 is the same as step S102, step S406 is the same as step S104, specifically, in step S404, a first simulation result of the first control policy is determined by the first simulation model in the security verification module, and a security characteristic value of the first control policy is determined according to the first simulation result, in step S406, it can be determined whether the security characteristic value is lower than a preset threshold value, if so, the second control policy is taken as the final control policy, otherwise, the first control policy is taken as the final control policy.

S408: and controlling the unmanned equipment according to the final control strategy.

And after the final control strategy is selected, the unmanned equipment can be controlled according to the final control strategy.

In addition, when the compensation parameters are determined and stored based on the simulation accuracy of the first simulation model and the second simulation model in the process of training the reinforcement learning model, the final control strategy selected in step S406 may be compensated, and the unmanned aerial vehicle may be controlled according to the compensated final control strategy in step S408.

By the above unmanned equipment control method shown in fig. 4, the unmanned equipment can be controlled by the reinforcement learning network in various complex scenes, so that the application of the reinforcement learning network is not limited to simple scenes such as highways.

Based on the same idea, the present specification further provides a corresponding apparatus, a storage medium, and an electronic device.

Fig. 5 is a schematic structural diagram of an apparatus for model training provided in an embodiment of the present specification, where the apparatus includes:

an input module 501, configured to input environment information into a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model respectively, so as to obtain a first control strategy output by the reinforcement learning network and a second control strategy output by the non-reinforcement learning algorithm module;

a security evaluation module 502 for determining a security characterizing value of the first control strategy;

a selecting module 503, configured to select a control policy to be optimized from the first control policy and the second control policy according to the security characterization value;

a feedback module 504, configured to determine changed environment information according to the control policy to be optimized;

a reward module 505, configured to determine a reward of the control policy to be optimized according to the changed environment information;

a training module 506, configured to adjust parameters of a reinforcement learning network in the reinforcement learning model with the reward maximization as a training target, where the trained reinforcement learning model is used to output a control strategy for controlling the unmanned device.

Optionally, the safety evaluation module 502 is specifically configured to obtain a first simulation result output by a first simulation model according to the first control policy and a preset first simulation model; determining a safety characteristic value of a first control strategy according to the first simulation result;

the feedback module 504 is specifically configured to obtain a second simulation result output by the second simulation model according to the control strategy to be optimized and a preset second simulation model; determining changed environment information according to the second simulation result;

Optionally, the selecting module 503 is specifically configured to determine whether the safety characterizing value is lower than a preset threshold; if so, taking the second control strategy as a control strategy to be optimized; and if not, taking the first control strategy as a control strategy to be optimized.

Optionally, the apparatus further comprises:

and an adjusting module 507, configured to adjust the trained reinforcement learning model according to a difference between simulation accuracies of the second simulation model and the first simulation model.

Optionally, the apparatus further comprises:

an adjusting module 507, configured to determine and store a compensation parameter according to a difference between simulation accuracies of the second simulation model and the first simulation model; and the compensation parameters are used for compensating the control strategy output by the trained reinforcement learning model.

Fig. 6 is a schematic structural diagram of an apparatus for controlling an unmanned aerial vehicle according to an embodiment of the present disclosure, where the apparatus includes:

an obtaining module 601, configured to obtain current environment information of the unmanned device;

an input module 602, configured to input the environment information into a reinforcement learning network and a non-reinforcement learning algorithm module in a reinforcement learning model, so as to obtain a first control strategy output by the reinforcement learning network and a second control strategy output by the non-reinforcement learning algorithm module;

a security evaluation module 603 configured to determine a security characterization value of the first control policy;

a selecting module 604, configured to select a final control policy from the first control policy and the second control policy according to the security characterization value;

and a control module 605, configured to control the unmanned device according to the final control policy.

Optionally, the selecting module 604 is specifically configured to determine whether the safety characterizing value is lower than a preset threshold; if so, taking the second control strategy as a final control strategy; otherwise, the first control strategy is used as a final control strategy.

Optionally, the control module 605 is specifically configured to compensate the final control strategy according to a pre-stored compensation parameter; and controlling the unmanned equipment according to the compensated final control strategy.

The present specification also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, is operable to perform the method of model training and the method of controlling an unmanned aerial device provided above.

Based on the model training method and the method for controlling the unmanned aerial vehicle provided above, the embodiment of the present specification further provides a schematic structural diagram of the electronic device shown in fig. 7. As shown in fig. 7, at the hardware level, the drone includes a processor, an internal bus, a network interface, a memory, and a non-volatile memory, although it may also include hardware required for other services. The processor reads a corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the model training method and the method for controlling the unmanned equipment.

Of course, besides the software implementation, the present specification does not exclude other implementations, such as logic devices or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may be hardware or logic devices.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The description has been presented with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the description. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of model training, comprising:

determining a security characterization value of the first control strategy;

adjusting parameters of a reinforcement learning network in the reinforcement learning model by taking the reward maximization as a training target, wherein the trained reinforcement learning model is used for outputting a control strategy for controlling the unmanned equipment;

selecting a control strategy to be optimized from the first control strategy and the second control strategy according to the safety characterization value, which specifically comprises:

2. The method according to claim 1, wherein determining the safety characterizing value of the first control strategy specifically comprises:

3. The method of claim 2, wherein the method further comprises:

4. The method of claim 2, wherein the method further comprises:

5. An apparatus for model training, comprising:

the training module is used for adjusting parameters of a reinforcement learning network in the reinforcement learning model by taking the reward maximization as a training target, wherein the trained reinforcement learning model is used for outputting a control strategy for controlling the unmanned equipment;

the selection module is specifically configured to determine whether the safety characterization value is lower than a preset threshold; if so, taking the second control strategy as a control strategy to be optimized; and if not, taking the first control strategy as a control strategy to be optimized.

6. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when being executed by a processor, carries out the method of any of the preceding claims 1-4.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-4 when executing the program.