CN113879323B

CN113879323B - Reliable learning type automatic driving decision-making method, system, storage medium and equipment

Info

Publication number: CN113879323B
Application number: CN202111246972.5A
Authority: CN
Inventors: 杨殿阁; 曹重; 周伟韬; 邓楠山; 焦新宇
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2023-03-14
Anticipated expiration: 2041-10-26
Also published as: CN113879323A

Abstract

The invention relates to a reliable learning type automatic driving decision-making method, a reliable learning type automatic driving decision-making system, a storage medium and equipment, wherein the reliable learning type automatic driving decision-making system comprises the following steps: constructing an interpretable decision based on a preset decision problem, the interpretable decision guiding the learning-based decision training; training a learning type decision by the decision problem to obtain the learning type decision of a decision value function with high value; and selecting the decision with high value in the learning decision and the interpretability decision as the final reliable learning decision action. The method and the device can guarantee the reliability of the learning type decision of the automatic driving automobile so as to ensure the high reliability of the automatic driving automobile. The invention can be widely applied in the technical field of automatic driving.

Description

Reliable learning type automatic driving decision-making method, system, storage medium and equipment

Technical Field

The invention relates to the technical field of decision making of an automatic driving automobile, in particular to a learning type automatic driving decision making method, a learning type automatic driving decision making system, a learning type automatic driving decision making storage medium and learning type automatic driving decision making equipment based on a reinforcement learning method and having reliable driving performance.

Background

The autonomous decision-making of the automatic driving automobile is an important component in an automatic driving automobile system, and the learning type automatic driving decision-making method is expected to obtain driving capability exceeding that of human beings through autonomous learning. The problem is that the learning-type method has a black box decision attribute, and the decision performance is difficult to predict, which contradicts the high reliability requirement of the automatic driving automobile. Therefore, the construction of a reliable learning type automatic driving decision-making method is important for improving the intelligent level of the automatic driving automobile.

At present, the method for guaranteeing the reliability of the learning-type automatic driving automobile mainly comprises the following steps: and three methods of increasing safety constraint, decision training induction and dangerous scene exploration are adopted. The main idea of the method for increasing the safety constraint is to analyze the safety of the learning decision output trajectory and adjust in time when a possible danger is found. The problem with this approach is that in complex scenarios, it is still very difficult to guarantee absolute security using artificially designed rules. Both the decision training induction method and the danger scene exploration method are used for adjusting the training direction or adding specific data in the decision training process so as to improve the safety of learning type decision. The difference is that the decision training induction method is to avoid the study type decision from exploring in dangerous scenes, and study in safe scenes as much as possible, so that the obtained driving strategies can be in safer scenes; on the contrary, the method for exploring the dangerous scene obtains the capability of handling the dangerous scene by repeatedly learning in the dangerous scene by the learning type decision. However, both of these methods simply depend on the learning ability of the learning-based decision-making itself, and do not consider the reliability of the final output result, so it is still difficult to implement a reliable learning-based automatic driving decision-making method.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a reliable learning type automatic driving decision method, system, storage medium and device, which can guarantee the reliability of the learning type decision of the automatic driving vehicle to ensure the high reliability of the automatic driving vehicle.

In order to achieve the above object, on one hand, the technical scheme adopted by the invention is as follows: a method of trustworthy learning-based automated driving decision-making, comprising: constructing an interpretable decision based on a predetermined decision problem, the interpretable decision guiding the learning-based decision training; training a learning type decision by the decision problem to obtain the learning type decision of a decision value function with high value; and selecting the decision with high value in the learning decision and the interpretable decision as the final reliable learning decision action.

Further, the decision problem consists of three elements; the three elements are environmental observation state, automatic driving action and instant reward.

Further, the training of a learning-based decision by the decision problem comprises:

setting a decision value function;

estimating a cost function of the interpretable decision;

and learning to obtain a high-value decision value function according to the interpretable decision value function and the set decision value function.

Further, the estimating a cost function of the interpretable decision comprises: and (3) obtaining the value function of the interpretable decision by constructing a data set and adopting a recursion method from the data set.

Further, the data set is composed of data elements; the data element is in a next-moment state obtained by adopting an interpretable driving strategy under different states;

alternatively, the data set is obtained by collecting driving data during driving of the vehicle by directly driving the vehicle using the interpretable decision.

Further, the learning to obtain a high-value decision cost function according to the interpretable decision cost function and the set decision cost function includes:

when the automatic driving automobile meets a state which is not met, the automatic driving automobile adopts the interpretable decision to drive, and initializes a value function of the learning decision according to the feedback of the environment;

when the automatic driving automobile encounters a state which is encountered once, generating a new action; and after the next moment state is obtained, updating the value function of the learning type decision according to the new action.

Further, the new action a is:

wherein N(s) represents the number of times the current state encounters, N (s, a) represents the number of times the current state takes an action, and Q (s, a) representsTaking the decision cost function of action a at state s, δ (s, a, π _r ) Is an interpretable decision-inducing value, pi _r Representing an interpretable decision, c is a constant that is manually adjusted.

On the other hand, the technical scheme adopted by the invention is as follows: a trustworthy learning-based automated driving decision system, comprising: the system comprises an interpretable decision construction module, a learning type decision training module and an output module; the interpretability decision constructing module is used for constructing an interpretability decision based on a preset decision problem, and the interpretability decision guides the learning type decision training; the learning decision training module trains the learning decision through the decision problem to obtain a learning decision of a decision value function with high value; and the output module selects the decision with high value in the learning decision and the interpretable decision as the final reliable learning decision action.

On the other hand, the technical scheme adopted by the invention is as follows: a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the above methods.

On the other hand, the technical scheme adopted by the invention is as follows: a computing device, comprising: one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods described above.

Due to the adoption of the technical scheme, the invention has the following advantages:

1. the invention adjusts the evaluation process of the value function of the reinforcement learning decision, so that the decision value function of the final generated strategy is higher than a certain interpretable driving strategy, thereby realizing the reliability guarantee of the learning decision of the automatic driving automobile.

2. The method can ensure the high reliability of the automatic driving automobile by ensuring the lower boundary of the expression through the interpretable strategy while fully exerting the decision-making capability of the reinforcement learning on a definite target under the highly uncertain environment, and is an important technology for realizing the reliable learning type decision-making of the automatic driving automobile.

Drawings

FIG. 1 is a flow chart illustrating a learning-based automatic driving decision method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of key elements of a learning-based decision-making problem according to an embodiment of the present invention;

FIG. 3 is a block diagram of a reliable learning-based decision making architecture in accordance with an embodiment of the present invention;

FIG. 4 is a graph of state-action sampling versus decision cost function in one embodiment of the present invention;

FIG. 5 is a graph of induced values versus learned decision confidence in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a computing device in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It should be apparent that the described embodiments are only some of the embodiments of the present invention, and not all of them. All other embodiments, which can be derived by a person skilled in the art from the description of the embodiments of the invention given above, are within the scope of protection of the invention.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The invention provides a credible learning type automatic driving decision-making method, a credible learning type automatic driving decision-making system, a credible learning type storage medium and credible learning type automatic driving decision-making equipment for an automatic driving automobile, which are used for solving a constructed decision-making problem so as to generate an optimal or near-optimal strategy. The present invention is not directed to a specific decision problem, but requires the construction of key elements of the decision problem.

In an embodiment of the present invention, a reliable learning type automatic driving decision method is provided, and this embodiment is exemplified by applying the method to a terminal, it is to be understood that the method may also be applied to a server, and may also be applied to a system including a terminal and a server, and is implemented through interaction between the terminal and the server.

In this embodiment, as shown in fig. 1, the method includes the following steps:

1) Constructing an interpretable decision based on a preset decision problem, and guiding learning type decision training by the interpretable decision;

2) Training the learning type decision by the decision problem to obtain a learning type strategy of a decision value function with high value;

3) And selecting a high-value decision from the learning decision and the interpretable decision as a final reliable learning decision action.

In the step 1), the decision problem consists of three elements; the three elements are environmental observation state, automatic driving action and instant reward.

In the present embodiment, as shown in fig. 2, an automatic driving decision problem is composed of the following three elements: environmental observation state s, autopilot maneuver a and transient reward r. Wherein:

the environmental observation state s refers to the surrounding dynamic and static element states obtained by a sensor and other methods, such as road information, environmental vehicle, pedestrian, riding person and other dynamic and static element information;

the automatic driving action a refers to an instruction which can be received by the control module, and can be a driving track containing speed information, or a vehicle control instruction, such as a steering wheel angle, an accelerator brake and the like;

the instantaneous reward r refers to a quantitative value of reward or punishment for the current driving state according to traffic rules, vehicle occupant requirements and the like, such as safety, smoothness, traffic efficiency and the like. Here, the transient reward function evaluates only the current environment and current actions, and does not need to reward or penalize future or past states and behaviors.

The construction of the above three elements is a precondition for using the decision method of the present invention.

In the step 1), the interpretable decision is constructed as follows: interpretable driving decisions refer to decision-making methods based on rule or model design, where the resulting actions have explicit logic. The mainstream decision making system of the automatic driving automobile is based on the interpretable decision making method. In the present invention, such an interpretable decision-making method will be used to guarantee the performance bound of the learning-based decision, i.e., the final learning-based decision-making performance is required to be not lower than the interpretable decision. The invention has no requirement on the source and the form of the interpretable decision method, and the interpretable decision is constructed into the following form:

wherein the content of the first and second substances,

representing the space formed by all possible environmental observation states;

representing the space formed by all possible decision actions; pi _r Representing the interpretable decision, i.e. the mapping from state space to action space; a is _r An action representing an interpretable decision output.

The construction method only restricts the input and output forms of the interpretable decision and requires the input and output forms to be consistent with the decision problem of the learning decision. In a specific decision process, only part of the information may be used for making a decision. The comprehension and adjustment capability of interpretable decisions is the basis for the trustworthiness of the present invention.

In the step 2), as shown in fig. 1 and 3, the training of the learning type decision by the decision problem includes the following steps:

2.1 Set a decision cost function;

the decision value refers to the evaluation of the current state and the performance of different decisions in a period of time in the future, and the decision value function is a function for establishing the relationship between the decision and the decision value. The trustworthy learning decision training process will be based on the evaluation of cost functions for different decisions, thereby obtaining a strategy with the highest possible value, while avoiding that the value of the generated strategy is lower than the interpretable decision.

The decision cost function is defined as follows:

wherein Q is _π Representing the cost function of the strategy pi, H representing the future forecast duration, and gamma representing a reward discount between 0 and 1 for reducing the impact of the far reward value. h represents the number of decision steps, is a positive integer (i.e. N),

representing a numerical expectation, r _t Indicating the prize values for different driving conditions.

2.2 Estimate a cost function of the interpretable decision;

the method specifically comprises the following steps: and (3) obtaining the value function of the interpretable decision by constructing a data set and adopting a recursion method through the data set.

The data set is composed of data elements; the data element is the next moment state obtained by adopting the interpretable driving strategy under different states.

For example, construct a data element τ _r ＝{s ₁ ,a ₁ ＝π _r (s ₁ ),s ₂ Data set of

. In the dataset, each data element represents the next tick obtained under different states using interpretable driving strategiesState. The following cost function evaluation process can be obtained by adopting a recursion method through the data set:

where α is the manually designed learning rate. The recursion process relies only on the current state, the interpretable policy, and the next-minute state.

Alternatively, the data set is obtained by collecting driving data during the driving of the vehicle by directly driving the vehicle using interpretable decisions. When the collected data is sufficient, a more accurate cost function of interpretable decisions can be obtained.

2.3 According to the interpretable decision value function and the set decision value function, learning to obtain a high-value decision value function.

Since the driving strategy of the learning-based decision is changing itself, the cost function will change accordingly. In the learning type training process, a decision value function needs to be estimated at the same time, and the decision value is improved by adjusting the decision. In this embodiment, the learning-based decision training process generates driving actions in different driving states, and then adjusts the actions according to environmental feedback, so as to improve decision performance. As shown in fig. 4, the method specifically includes:

when the automatic driving automobile meets a state which is not met, the automatic driving automobile adopts interpretable decision to drive, and a value function of a learning type decision is initialized according to the feedback of the environment;

Wherein, the new action a is generated as follows:

where N(s) denotes the current state encounteredN (s, a) represents the number of times an action is taken in the current state, Q (s, a) represents the decision cost function of taking action a in state s, δ (s, a, pi) _r ) Is an interpretable decision-inducing value, pi _r Representing an interpretability decision, c is a constant that is manually adjusted.

The decision-making induction value delta (s, a, pi) can be explained _r ) The definition is as follows:

wherein, c _thres Is an artificially designed positive number, and when the action is the same as the interpretable decision, the interpretable decision-inducing value is c _thres Otherwise, it is 0. c. C _thres Related to the degree of conservation of the reliable decision, as shown in FIG. 5, when c _thres Towards infinity, the learned decision output will be exactly the same as the interpretable decision. When c is going to _thres Towards infinity, the strategy generated by the learned decision will be independent of the interpretable decision, at which point the learned decision will lose trustworthiness. Generally will c _thres Designed to be a value slightly greater than 0.

After the next state is obtained, the updating mode of the cost function is as follows:

Q(s _t ,a _t )←Q(s _t ,a _t )+α[r(s _t+1 )+γQ(s _t+1 ,a)-Q(s _t ,a _t )] (5)

in the above embodiments, after obtaining the cost function capable of explaining the decision and the learning decision, the method for generating the action of the autonomous vehicle during driving is as follows:

wherein, Q(s) _t A) a cost function, Q, representing different actions generated by a learning-type decision _r (s _t ,π _r (s _t ) A cost function representing an interpretable decision.

In the step 3), the selection of the final action judges whether the learning decision value is higher than the interpretable decision value, if so, the learning decision value is selected, if not, the interpretable decision value is selected, and the driving decision formed by the mechanism has credibility.

In the embodiment, after the interpretable decision is adjusted, the cost function of the interpretable decision is only required to be re-estimated, and the method can be applied to the trained learning-based decision, so that the reliable learning-based decision of the automatic driving automobile is realized.

In summary, a reliable automatic driving learning decision method is designed from a learning decision mechanism, which is one of effective ways for improving the intelligent level of an automatic driving vehicle and further achieving reliable automatic driving in a complex scene, and the development of an automatic driving vehicle is promoted.

In one embodiment of the present invention, there is provided a trustworthy learning automatic driving decision system, comprising: the system comprises an interpretable decision construction module, a learning type decision training module and an output module;

the interpretable decision building module builds an interpretable decision based on a preset decision problem, and the interpretable decision guides learning type decision training;

the learning decision training module trains the learning decision through a decision problem to obtain a learning decision of a decision value function with high value;

and the output module selects a high-value decision from the learning decision and the interpretability decision as a final reliable learning decision action.

The system provided in this embodiment is used for executing the above method embodiments, and for details of the process and the details, reference is made to the above embodiments, which are not described herein again.

As shown in fig. 6, which is a schematic structural diagram of a computing device provided in an embodiment of the present invention, the computing device may be a terminal, and may include: a processor (processor), a communication Interface (communication Interface), a memory (memory), a display screen and an input device. The processor, the communication interface and the memory are communicated with each other through a communication bus. The processor is used to provide computing and control capabilities. The memory includes a non-volatile storage medium, an internal memory, the non-volatile storage medium storing an operating system and a computer program that when executed by the processor implements a method of decision making; the internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The communication interface is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, a manager network, NFC (near field communication) or other technologies. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on a shell of the computing equipment, an external keyboard, a touch pad or a mouse and the like. The processor may call logic instructions in memory to perform the following method:

constructing an interpretable decision based on a preset decision problem, and guiding learning type decision training by the interpretable decision; training the learning type decision by the decision problem to obtain the learning type decision of a decision value function with high value; and selecting a high-value decision from the learning decision and the interpretable decision as a final reliable learning decision action.

In addition, the logic instructions in the memory may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects may be applied, and that a particular computing device may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment of the invention, a computer program product is provided, the computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the method provided by the above-described method embodiments, for example, comprising: constructing an interpretable decision based on a preset decision problem, and guiding learning type decision training by the interpretable decision; training the learning type decision by the decision problem to obtain the learning type decision of a decision value function with high value; and selecting the decision with high value in the learning decision and the interpretability decision as the final reliable learning decision action.

In one embodiment of the invention, a non-transitory computer-readable storage medium is provided, which stores server instructions that cause a computer to perform the methods provided by the above embodiments, for example, including: constructing an interpretable decision based on a preset decision problem, and guiding learning type decision training by the interpretable decision; training the learning type decision by the decision problem to obtain the learning type decision of a decision value function with high value; and selecting a high-value decision from the learning decision and the interpretable decision as a final reliable learning decision action.

The implementation principle and technical effect of the computer-readable storage medium provided by the above embodiments are similar to those of the above method embodiments, and are not described herein again.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for reliable learning-based automated driving decision making, comprising:

constructing an interpretable decision based on a preset decision problem, the interpretable decision guiding the learning-based decision training;

training a learning type decision by the decision problem to obtain the learning type decision of a decision value function with high value;

selecting a high-value decision from the learning decision and the interpretable decision as a final reliable learning decision action;

the decision problem consists of three elements; the three elements are environment observation state, automatic driving action and instant reward;

the construction of the interpretable decision is: an interpretable decision method will be used to ensure a performance lower bound for the learning-based decision, requiring that the final learning-based decision performance not be less than the interpretable decision; the interpretable decision is constructed in the form:

a _r ＝π _r (s)

wherein the content of the first and second substances,

representing the space formed by all possible environmental observation states;

representing the space formed by all possible decision-making actions; pi _r Representing the interpretable decision, a mapping from a state space to an action space; a is a _r An act of representing an interpretable decision output; s is a state;

the construction method of the interpretable decision only restricts the input and output forms of the interpretable decision and needs to be consistent with the decision problem of the learning type decision;

the decision problem trains learning-based decisions, including:

setting a decision value function;

estimating a cost function of the interpretable decision;

2. The method for trustworthy learning-based automated driving decision-making of claim 1, wherein said estimating a cost function of said interpretable decision comprises: and (3) obtaining the value function of the interpretable decision by constructing a data set and adopting a recursion method from the data set.

3. A method for reliable learning-based automated driving decision making according to claim 2 wherein said data set is comprised of data elements; the data element is in a next-moment state obtained by adopting an interpretable driving strategy under different states;

alternatively, the data set is obtained by collecting driving data during the driving of the vehicle by driving the vehicle directly using the interpretable decision.

4. The confidence-based learning-based automated driving decision-making method according to claim 1, wherein learning a high-value decision-making cost function based on the interpretable decision-making cost function and the set decision-making cost function comprises:

5. The trustworthy learning-based automated driving decision method of claim 4, wherein the new action a is:

where N(s) represents the number of times the current state encounters, N (s, a) represents the number of times the current state takes an action, Q (s, a) represents the decision cost function for taking action a at state s, and δ (s, a, π _r ) Is an interpretable decision-making induction value, π _r Representing an interpretable decision, c is a constant that is manually adjusted.

6. A trustworthy learning-based automated driving decision system, comprising: the system comprises an interpretable decision construction module, a learning type decision training module and an output module;

the interpretable decision building module builds an interpretable decision based on a preset decision problem, and the interpretable decision guides the learning type decision training;

the learning decision training module trains a learning decision through the decision problem to obtain a learning decision of a decision value function with high value;

the output module selects a high-value decision from the learning decision and the interpretable decision as a final reliable learning decision action;

the decision problem consists of three elements; the three factors are environmental observation state, automatic driving action and instant reward;

a _r ＝π _r (s)

wherein the content of the first and second substances,

representing a space formed by all possible environmental observation states;

representing the space formed by all possible decision-making actions; pi _r Representing the interpretability decision, a mapping from a state space to an action space; a is _r An act of representing an interpretable decision output; s is a state;

the decision problem trains learning-based decisions, including:

setting a decision value function;

estimating a cost function of the interpretable decision;

7. A computer readable storage medium storing one or more programs, wherein the one or more programs comprise instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-5.

8. A computing device, comprising: one or more processors, memory, and one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs including instructions for performing any of the methods of claims 1-5.