CN107479547B

CN107479547B - Decision tree behavior decision algorithm based on teaching learning

Info

Publication number: CN107479547B
Application number: CN201710687194.0A
Authority: CN
Inventors: 王祝萍; 邢文治; 张皓; 陈启军
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2017-08-11
Filing date: 2017-08-11
Publication date: 2020-11-24
Anticipated expiration: 2037-08-11
Also published as: CN107479547A

Abstract

The invention discloses a decision tree behavior decision algorithm based on teaching learning, which mainly solves the problem that the existing decision algorithm in the prior art can not simultaneously meet the requirements of comprehensive complex scenes and stability. The decision tree behavior decision algorithm based on teaching learning comprises the following steps: storing a state transition rule of the teaching track; obtaining a state transition frequency matrix and a state transition probability matrix; constructing a reward; the decision tree evaluates the actions to be generated; updating a transition frequency matrix and a state transition probability matrix; the above procedure was repeated until the evaluation passed. Through the scheme, the invention achieves the purposes of maximum reasonability and safety of unmanned driving behavior decision.

Description

Decision tree behavior decision algorithm based on teaching learning

Technical Field

The invention relates to the field of unmanned driving, in particular to a decision tree behavior decision algorithm based on teaching learning.

Background

An unmanned vehicle is a high-level form of a mobile robot with autonomous driving capability. The intelligent computing system can realize three functions of environment perception, decision planning and motion control. Compared with other small mobile robots, the system is complex in structure. Besides basic mobile driving capability, the system can perform real-time data fusion and positioning by using various sensors such as radar and camera in cooperation with a special high-precision map, so as to realize perception and understanding of the current environment. Meanwhile, according to the road and moving obstacle information understood by the sensor, the vehicle uses a decision planning algorithm to cut out a reasonable and feasible expected track, and the control module carries out final vehicle moving behavior implementation. The whole intelligent computing system comprises important key technologies such as lane line detection, obstacle identification, high-precision maps, high-precision positioning, decision planning algorithms, controller design and the like, relates to numerous disciplinary knowledge, and has extremely high theoretical research significance and engineering practice value.

The field of unmanned vehicle research includes three directions of environment perception, behavior decision and planning control. The behavior decision is used as a central position for connecting environment perception and planning control, has a very important position, and has become a key point and a difficulty point of research in the field of unmanned driving. The behavior decision is the process of selecting the best scheme which meets the purpose of self behavior from several feasible schemes selectable under the current environment. In this process, a specific decision algorithm is often needed to perform prediction evaluation on the result state after the action is taken, and the best action is selected under the unified judgment standard. For the unmanned vehicle, the behavior decision needs to acquire perception and understanding of the external environment according to data information fused by sensors such as a current radar and a camera, reasonably predict the next behavior to be executed by the vehicle, transmit the selectable behavior to a planning control system in a physical value form according to a decision algorithm, and further realize the expected behavior of a decision module so as to realize unmanned autonomous driving of the vehicle.

The behavior decision theory appears in the fields of psychology, management and economics at first, and is gradually expanded to be applied to other directions later. Currently, behavior decisions regarding vehicles are mainly focused on traditional empirical methods such as finite state machines, decision trees, multi-attribute decisions, and learning-based prediction methods. Experience-based design methods cannot be extended to synthesize complex scenes; although the learning prediction-based method has stability and safety which are difficult to determine for behaviors, the adaptability to scenes is far better than that of the experience-based design method. In view of the development of unmanned driving, the problem of complexity and variability of scenes is necessarily faced, and a learning prediction-based method becomes the best option for realizing vehicle behavior decision. Teaching learning as a learning prediction-based method can effectively solve the expansibility of a scene, and is an efficient behavior decision solution.

In practical applications, teaching learning alone as part of the decision-making of unmanned behavior does not solve this problem. The decision of the driverless behavior should ensure the maximum rationality of the behavior. The common teaching and learning is to carry out probability modeling on the behavior decision of unmanned driving theoretically, and the unreasonable behavior is difficult to avoid to the greatest extent from the practical problem. In addition, the data of the teaching part does not completely cover the global space. The teaching data provides somewhat less a priori decision knowledge. For the decision-making problem of the unmanned behavior, the decision-making system needs to be able to continuously strengthen the updating strategy on the basis of the prior knowledge.

Disclosure of Invention

The invention aims to provide a decision tree behavior decision algorithm based on teaching learning, which aims to solve the problem that the unreasonable behavior is difficult to avoid to the greatest extent from the practical problem in the conventional decision algorithm.

In order to solve the above problems, the present invention provides the following technical solutions:

a decision tree behavior decision algorithm based on teaching learning comprises the following steps:

(a) describing a teaching rule in teaching learning by using a state transition frequency matrix and a state transition probability matrix of the behavior, and storing the state transition rule of a teaching track;

(b) obtaining a state transition frequency matrix and a state transition probability matrix according to the step (a);

(c) constructing a reward according to the state transition frequency;

(d) when the transition probability matrix outputs the selection action to be performed, the decision tree evaluates the action to be generated by the state transition probability matrix according to the step (b), if the evaluation is passed, the state transition is executed, and if the evaluation is not passed, the step (e) is executed;

(e) updating a transition frequency matrix and a state transition probability matrix through an Actor-Critic algorithm according to the steps (b) and (c);

(f) repeating steps (d) and (e) until the evaluation is passed.

Specifically, the specific process of step (a) is as follows: firstly, rasterizing the length of a predicted road surface; designing a state transition table for recording a conversion relation; and filling the frequency of the transition table in a matrix form, wherein the frequency is used as the frequency of transition from the current state to the subsequent state in the teaching, and the state transition probability is obtained by calculating the access frequency of the subsequent n possible states of the current state through a softmax function.

Specifically, the specific process of step (b) is as follows: the state transition frequency is the number of times of the state to be accessed in the current state, and the state transition probability is the transition probability value obtained by calculating the number of times; discretizing and sampling the state transition track of the teaching learning to construct a state transition frequency matrix, wherein the state transition probability is obtained by calculating the access frequency of the subsequent n possible states of the current state through a softmax function.

Specifically, the specific process of step (c) is as follows: comparing the state action to be performed with the expected state action; if the result meets the expectation, the reward is added, otherwise, the negative reward punishment is carried out; if the behavior which is closer to the expected action than the selected action appears in other unselected actions in the current state, carrying out reward point adding; finally, fitting the discrete state points to obtain a planning curve; wherein the variation expression of the reward is designed as follows:

the above equation indicates that when the action is as desired, Δ r may be set to + 1; conversely, when the action is not desired, Δ r may be set to-1, where a_uIs the desired action, and a is the action to be performed.

Specifically, the specific process of step (d) is: the decision tree judges the reasonability and safety of action transfer through two aspects; if all the evaluation is satisfied, the evaluation is passed, otherwise, the evaluation is not passed;

firstly, judging the reasonability of state transition to confirm that the vehicle can realize the transition under the condition of self physical condition limitation; the evaluation procedure is s_i→s_j,||i-j||＝1；

In the above formula s_iRepresents the ith state; the formula shows that the vehicle can select a transition state in the adjacent state of the current state every time the vehicle moves;

secondly, after the track points are fitted, expansion is carried out, and no other obstacles exist in the track travelable area:

wherein

Is the abscissa, xo, of the state si relative to the vehicle_bstacle,yo_bstacleObstacle horizontal and vertical coordinates, x, of adjacent area_width,y _length1/2 for the width and length of the vehicle, respectively.

Specifically, the specific process of step (e) is: the strengthening method comprises the following steps:

_t＝r_t+γV(s_t+1)-V(s_t),p(s_t,a_t):＝p(s_t,a_t)+β_t

wherein r is_tAwarding immediately; v(s)_t) Is the predicted jackpot for the current state, V(s)_t+1) Is the jackpot after prediction from the next state, β is the update degree, γ is the reward confidence after the current prediction, p(s)_t,a_t) Is in a state s_tPerforming action a_tThe formula (2) is updated based on transition probabilities obtained by teaching learned transition frequencies.

Compared with the prior art, the invention has the following beneficial effects: the decision tree algorithm is located in the middle position, and the state transition rule is carried upwards and is connected downwards to strengthen or correct the state transition rule. For the teaching rule of a human driver, the method defines two matrixes of state transition frequency and state transition probability for description. The state transition frequency is the number of times of the state to be accessed in the current state, and the state transition probability is the transition probability value obtained by calculating such number of times. When the transition probability outputs the selection action to be performed, the decision tree algorithm needs to check and evaluate the rationality or safety of the current action. After the decision tree evaluation, the algorithm can correct the current state transition frequency matrix, increase the frequency of reasonable actions and reduce the frequency of unreasonable actions. The corrected state transition frequency matrix can continuously calculate the corresponding transition probability so as to carry out reciprocating cycle reinforcement; the maximum reasonability and safety of unmanned behavior decision are ensured.

Drawings

Fig. 1 is a diagram of expert teach lane access in the present invention.

FIG. 2 is a graph of the recovery results of the present invention.

FIG. 3 is a first partial experimental data recovery fit.

FIG. 4 is a second partial experimental data recovery fit.

Fig. 5 is a third graph of the recovery fit of a portion of experimental data.

Fig. 6 is a graph of a partial experimental data recovery fit.

Detailed Description

The present invention is further illustrated by the following figures and examples, which include, but are not limited to, the following examples.

In the whole algorithm frame, the decision tree algorithm is in the middle position, the state transition rule is carried upwards, and the state transition rule is strengthened or corrected downwards in a connected mode. For the teaching rule of a human driver, the method defines two matrixes of state transition frequency and state transition probability for description. The state transition frequency is the number of times of the state to be accessed in the current state, and the state transition probability is the transition probability value obtained by calculating such number of times. When the transition probability outputs the selection action to be performed, the decision tree algorithm needs to check and evaluate the rationality or safety of the current action. After the decision tree evaluation, the algorithm can correct the current state transition frequency matrix, increase the frequency of reasonable actions and reduce the frequency of unreasonable actions. The corrected state transition frequency matrix can continuously calculate the corresponding transition probability so as to carry out reciprocating cycle reinforcement; the specific process is as follows:

the decision tree behavior decision algorithm based on teaching learning comprises the following steps:

firstly, rasterizing the length of a predicted road surface; designing a state transition table for recording a conversion relation; filling the frequency of a transition table in a matrix form, and taking the frequency as the frequency of transition from the current state to the subsequent state in the teaching, wherein the state transition probability is obtained by calculating the access frequency of n subsequent possible states of the current state through a softmax function;

the state transition frequency is the number of times of the state to be accessed in the current state, and the state transition probability is the transition probability value obtained by calculating the number of times; discretizing and sampling the state transition track of the teaching learning to construct a state transition frequency matrix, wherein the state transition probability is obtained by calculating the access frequency of the subsequent n possible states of the current state through a softmax function.

(c) Constructing a reward according to the state transition frequency;

comparing the state action to be performed with the expected state action; if the result meets the expectation, the reward is added, otherwise, the negative reward punishment is carried out; if the behavior which is closer to the expected action than the selected action appears in other unselected actions in the current state, carrying out reward point adding; finally, fitting the discrete state points to obtain a planning curve; wherein the variation expression of the reward is designed as follows:

the decision tree judges the reasonability and safety of action transfer through two aspects; if all the evaluation is satisfied, the evaluation is passed, otherwise, the evaluation is not passed;

wherein

Is the abscissa, x, of the state si with respect to the vehicle_obstacle,yo_bstacleObstacle horizontal and vertical coordinates, x, of adjacent area_width,y _length1/2 for the width and length of the vehicle, respectively. .

the strengthening method comprises the following steps:

_t＝r_t+γV(s_t+1)-V(s_t),p(s_t,a_t):＝p(s_t,a_t)+β_t

(f) Repeating steps (d) and (e) until the evaluation is passed.

The strategy updating part of the invention adopts an Actor-Critic algorithm. The Actor-criticic algorithm is a model-free algorithm that can be used in model-free cases as well as in model-present cases. The model-free solving algorithm is a major breakthrough in the Markov solving method, and moves the mathematical means with stronger theoretical performance to the application scene which is more in line with the actual concrete problem. Within the category of model-based algorithms, a common attribute is that the solution strategy needs to rely on existing prior transition models and reward structures. In contrast, a model-free solution strategy is not required. In general, it is very difficult to ideally model problems in life. It can be said that the complete markov process model is hidden in life and is difficult to make an obvious decision. From this point, the modeless algorithm is more suitable for solving the specific problem by simplifying the processing of the problem model constraint condition with the original theory. In the model-free algorithm, an agent can sample and acquire transition probability and other relevant variable information based on model definition through interaction to the environment, acquire prior cognition to the environment from a statistical view angle, and estimate a required reward function; alternatively, the agent uses a modeless algorithm to solve the reward function in a fuzzy manner to approximate the optimization objective. This is really a choice in two directions. After the first method is interactive with the environment to learn and obtain the transition probability and the reward model, the optimal strategy can be obtained by using a model-based solving method. The second direct approximation method is to solve the optimal strategy by a fuzzy means, and does not need any requirement on the form of the model. In many solution methods, there is an algorithm that combines both methods, and the reward learning is accelerated by using an approximate model while estimating the reward function, and both methods are iteratively updated. It should be noted that, in these model-free algorithms, the second direct approximation method is most concerned and the application range is also the widest; the following are simulation design and experimental design in the specific experimental process;

simulation design

In this simulation, the states need to be discretized into 27. The final state transition matrix size is. The probability matrix of the transition is calculated from the access frequency of the next 5 possible states of the current state.

The decision tree framework here detects the feasibility of the transition state. In this simulation, the decision tree will be detected as follows:

1. the lane number of the state jump is detected. If the difference in the values between lane numbers is greater than 2, it is indicated that the vehicle is facing a direct jump from the current leftmost lane to the rightmost lane. The algorithm sets the access frequency of the subsequent state after the state to 0. The algorithm continues to select a transferable state with the highest probability from the updated frequency matrix.

2. After passing the lane number detection, the algorithm needs to detect whether the left or right turn hits other obstacles. If other obstacles are hit, the access frequency of the subsequent state is reduced to half of the original one. The algorithm continues to select a transferable state with the highest probability from the updated frequency matrix.

3. After the above detection, the algorithm may perform a state transition.

FIGS. 1 and 2 show simulation results

Design of experiments

In this experiment, the teaching data was derived from sampling the vehicle travel trajectory. During the driving of the vehicle by the driver, the vehicle generally travels in the right lane. When meeting an obstacle, the vehicle changes lanes to avoid in a certain distance between the right lane and the right lane. There are 5 sampling states between the vehicle and the obstacle at the time of lane change. For such a sampling process, the detection tree and the enhancement process are as follows:

a transfer matrix is constructed by utilizing the discretized sampling state, and the transfer matrix is filled according to the sampling data;

1. calculating a transition probability matrix by using the obtained transition frequency matrix;

2. the rationality of the state jumps is detected by means of a detection tree. The state of the vehicle is not allowed to jump directly from the right lane to the left lane or from the left lane to the right lane.

3. Detecting whether the left turn or the right turn of the state can touch an obstacle or not by using detection;

4. for the jump of the state, the distance between the current state and the obstacle is calculated. And if the distance between the current state and the obstacle is calculated to be larger than the distance between the vehicle and the obstacle when the vehicle deflects the track in the teaching data and is a state distance, adding 1 to the access frequency of the subsequent adjacent state of the non-same lane of the state.

5. And updating the frequency matrix, and selecting the state with the highest probability for transfer.

6. And carrying out interpolation fitting on the discrete states.

Fig. 3 to 6 are experimental results.

The invention is well implemented in accordance with the above-described embodiments. It should be noted that, based on the above structural design, in order to solve the same technical problems, even if some insubstantial modifications or colorings are made on the present invention, the adopted technical solution is still the same as the present invention, and therefore, the technical solution should be within the protection scope of the present invention.

Claims

1. A decision tree behavior decision algorithm based on teaching learning is characterized by comprising the following steps:

(c) constructing a reward according to the state transition frequency;

(d) when the state transition probability matrix outputs the selection action to be performed, the decision tree evaluates the action to be generated by the state transition probability matrix according to the step (b), if the evaluation is passed, the state transition is executed, and if the evaluation is not passed, the step (e) is executed;

(e) updating a state transition frequency matrix and a state transition probability matrix through an Actor-Critic algorithm according to the steps (b) and (c);

(f) repeating steps (d) and (e) until the evaluation is passed.

2. The decision tree behavior decision algorithm based on teaching learning of claim 1, wherein the specific process of step (a) is as follows: firstly, rasterizing the length of a predicted road surface; designing a state transition table for recording a conversion relation; and filling the frequency of the state transition table in a matrix form, wherein the frequency is used as the frequency of transition from the current state to the subsequent state in the teaching, and the state transition probability is obtained by calculating the access frequency of the subsequent n possible states of the current state through a softmax function.

3. The decision tree behavior decision algorithm based on teaching learning of claim 1, wherein the specific process of step (b) is as follows: the state transition frequency is the number of times of the state to be accessed in the current state, and the state transition probability is the transition probability value obtained by calculating the number of times; discretizing and sampling the state transition track of the teaching learning to construct a state transition frequency matrix, wherein the state transition probability is obtained by calculating the access frequency of the subsequent n possible states of the current state through a softmax function.

4. The decision tree behavior decision algorithm based on teaching learning of claim 1, wherein the specific process of step (c) is as follows: comparing the state action to be performed with the expected state action; if the result meets the expectation, the reward is added, otherwise, the negative reward punishment is carried out; if the behavior which is closer to the expected action than the selected action appears in other unselected actions in the current state, carrying out reward point adding; finally, fitting the discrete state points to obtain a planning curve; wherein the variation expression of the reward is designed as follows:

the above equation indicates that when the motion is expected, Δ r is set to + 1; conversely, when the action is not desirable, setting Δ r to-1, where a_uIs a desired action, a is an action to be performed; Δ r denotes the execution of action a_uThe latter prize value.

5. The decision tree behavior decision algorithm based on teaching learning of claim 1, wherein the specific process of step (d) is as follows: the decision tree judges the reasonability and safety of action transfer through two aspects; if all the evaluation is satisfied, the evaluation is passed, otherwise, the evaluation is not passed;

In the above formula s_iRepresents the ith state; this equation indicates that each time the vehicle is moving, the transition state is selected in the vicinity of the current state, where s_iAnd s_jRespectively representing the states before and after a certain action is executed, and | i-j | ═ 1 is a constraint condition of the states; the value ranges of i and j are both natural numbers;

wherein

Is shape ofState s_iWith respect to the longitudinal and transverse coordinates, x, of the vehicle_obstacle,y_obstacleObstacle horizontal and vertical coordinates, x, of adjacent area_width,y_length1/2 for the width and length of the vehicle, respectively.

6. The teach-learning-based decision tree behavior decision algorithm according to claim 1, wherein the specific process of step (e) is; the strengthening method comprises the following steps:

_t＝r_t+γV(s_t+1)-V(s_t)，p(s_t，a_t)＝p(s_t，a_t)+β_t

wherein V(s)_t) Is the predicted jackpot for the current state, V(s)_t+1) Is the jackpot after prediction from the next state, β is the update degree, γ is the reward confidence after the current prediction, p(s)_t,a_t) Is in a state s_tPerforming action a_tThe formula (c) is updated on the basis of transition probabilities obtained by teaching the learned transition frequencies; wherein_tIs in slave state s_tTo s_t+1TD error of (r)_tIs a state s_tTo s_t+1Is awarded immediately.