CN108791302B

CN108791302B - Driver behavior modeling system

Info

Publication number: CN108791302B
Application number: CN201810662040.0A
Authority: CN
Inventors: 邹启杰; 李昊宇; 裴腾达
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2020-05-19
Anticipated expiration: 2038-06-25
Also published as: CN108791302A

Abstract

The invention discloses a driver behavior modeling system, which specifically comprises a feature extractor, a model extractor and a model database, wherein the feature extractor is used for extracting and constructing return function features; the return function generator is used for acquiring a return function required by constructing the driving strategy; the driving strategy acquirer completes construction of a driving strategy; the judger judges whether the optimal driving strategy constructed by the acquirer meets judgment standards or not; if the driving strategy does not meet the judgment standard, reconstructing a return function, repeatedly constructing an optimal driving strategy, and repeatedly iterating until the judgment standard is met; finally, a driving strategy describing a real driving demonstration is obtained. The method and the device can be applied to new state scenes to obtain corresponding actions, the generalization capability of the established driver behavior model is greatly improved, the applicable scenes are wider, and the robustness is stronger.

Description

Driver behavior modeling system

Technical Field

The invention relates to a modeling method, in particular to a driver behavior modeling system.

Background

Autonomous driving is an important part of the intelligent transportation field. For reasons such as current technology, autonomous vehicles still require intelligent driving systems (intelligent driver assistance systems) and human drivers to cooperate with each other to accomplish driving tasks. In the process, driver modeling is an essential important step whether the information of the driver is better quantified for decision making of an intelligent system or personalized services are provided for people by distinguishing different drivers.

Among the current methods related to modeling of drivers, the reinforcement learning method has a good solution effect on complex sequential decision problems of large-scale continuous space and multiple optimization targets when the drivers drive vehicles, and thus is an effective method for modeling of driver behaviors. Reinforcement learning, as an MDP-based problem solving method, requires interaction with the environment, taking action to obtain a feedback signal, i.e. reward, from the evaluated nature of the environment and maximizing the long-term reward.

Through the search of the existing literature, the existing setting method for the reward function in the modeling of the driver behavior mainly comprises the following steps: the conventional method of setting for different scene states manually by researchers and the method of setting by means of the reverse reinforcement learning method. The traditional method has great subjectivity on researchers, and the quality of the return function depends on the abilities and experiences of the researchers. Meanwhile, in the driving process of the vehicle, in order to correctly set the return function, a large number of decision variables need to be balanced, the variables have great incoherence and even contradiction, and researchers often cannot design the return function capable of balancing various requirements.

The reverse reinforcement learning distributes proper weight for various driving characteristics by means of driving demonstration data, and can automatically learn to obtain a required return function, so that the defect of original artificial decision is overcome. However, the traditional reverse reinforcement learning method can only learn the existing scene state in the driving demonstration data, and when the driver actually drives, the actual driving scene often exceeds the driving demonstration range due to different factors such as weather, scenery and the like. Therefore, the method of reverse reinforcement learning solves the problem that the relationship between scenes and decision actions in driving demonstration data shows insufficient generalization capability.

The existing driver behavior modeling method based on the reinforcement learning theory mainly has two ideas: in the first idea, a traditional reinforcement learning method is adopted, the setting of a return function depends on the analysis, arrangement, screening and induction of a researcher on scenes, and then a series of characteristics related to driving decisions are obtained, such as: the distance between the front of the vehicle and the curb or not, the pedestrian or not, the reasonable speed, the lane change frequency and the like; and designing a series of experiments according to the driving scene requirements to obtain the weight proportion of the characteristics in the return function under the corresponding scene environment, and finally completing the overall design of the return function to be used as a model for describing the driving behavior of the driver. And secondly, solving a driving behavior characteristic function by adopting maximum entropy reverse reinforcement learning based on a probabilistic model modeling method. First, assuming that there is a potential, specific one of the probability distributions, a demonstration trajectory of driving is generated; furthermore, it is necessary to find a probability distribution that can be fitted to the driving demonstration, and the problem of finding this probability distribution can be translated into a non-linear programming problem, namely:

max-plogp

∑P＝1

p is the probability distribution of the demonstration track, and after the probability distribution is obtained through the solving of the formula, the probability distribution is obtained

Obtaining the relevant parameters, i.e. obtaining the return function r ═ theta^Tf(s_t)。

The traditional driver driving behavior model utilizes the known driving data to analyze, describe and reason the driving behavior, however, the collected driving data cannot completely cover infinite driving behavior characteristics, and the situation of corresponding actions of all states cannot be obtained. In an actual driving scene, because of different weather, scenes and objects, the driving state has many possibilities, and it is impossible to traverse all the states. Therefore, the traditional driving behavior model of the driver has weak generalization capability, more assumed conditions of the model and poor robustness.

Secondly, in the actual driving problem, the method of setting the reward function only by the researcher needs to balance too many requirements for various characteristics, completely depends on the experience setting of the researcher, repeatedly and manually adjusts, consumes time and labor, and is over subjective even if the researcher is fatal. Under different scenes and environments, researchers need to face too many scene states; meanwhile, even if the requirements are different for a certain scene state, the driving behavior characteristics can be changed. To accurately describe the driving task, a series of weights are assigned to accurately describe these factors. In the existing method, reverse reinforcement learning based on a probability model is mainly based on existing demonstration data, the demonstration data is used as existing data, the distribution situation of the corresponding current data is sought, and action selection in the corresponding state can be obtained based on the distribution situation. However, the known data distribution does not indicate the distribution of all data, and the distribution needs to be acquired correctly, and all states need to be acquired.

Disclosure of Invention

In order to solve the problem of weak generalization of driver modeling, namely the technical problem that a corresponding return function cannot be established to carry out driver behavior modeling under the condition that a driving scene is not in demonstration data in the prior art, the application provides the driver behavior modeling system which can be applied to a new state scene to obtain corresponding actions, so that the generalization capability of the established driver behavior model is greatly improved, the applicable scene is wider, and the robustness is stronger.

In order to achieve the purpose, the technical points of the scheme of the invention are as follows: the driver behavior modeling system specifically comprises:

the characteristic extractor is used for extracting and constructing return function characteristics;

a return function generator for obtaining a driving strategy;

the driving strategy acquirer completes construction of a driving strategy;

the judger judges whether the optimal driving strategy constructed by the acquirer meets judgment standards or not; if the driving strategy does not meet the judgment standard, reconstructing a return function, repeatedly constructing an optimal driving strategy, and repeatedly iterating until the judgment standard is met; finally, a driving strategy describing a real driving demonstration is obtained.

Further, the specific implementation process of extracting and constructing the return function features by the feature extractor is as follows:

s11, in the driving process of the vehicle, a camera placed behind a windshield of the vehicle is used for sampling a driving video to obtain N groups of pictures of road conditions of different vehicle driving environments; meanwhile, corresponding to driving operation data, namely the steering angle condition under the road environment, training data are jointly constructed;

s12, translating, cutting and changing the brightness of the collected pictures to simulate scenes with different illumination and weather;

s13, constructing a convolutional neural network, taking the processed picture as input, taking the operation data of the corresponding picture as a tag value, training, and solving an optimal solution for mean square error loss by adopting an optimization method based on a Nadam optimizer to optimize weight parameters of the neural network;

s14, storing the network structure and the weight of the trained convolutional neural network to establish a new convolutional neural network to complete the state feature extractor.

Further, the convolutional neural network established in step S13 includes 1 input layer, 3 convolutional layers, 3 pooling layers, and 4 full-link layers; the input layer is sequentially connected with the first convolution layer and the first pooling layer, then connected with the second convolution layer and the second pooling layer, then connected with the third convolution layer and the third pooling layer, and finally sequentially connected with the first full-connection layer, the second full-connection layer, the third full-connection layer and the fourth full-connection layer.

Further, the convolutional neural network after the training in step S14 is completed does not include an output layer.

Further, the concrete implementation process of the reward function generator for obtaining the driving strategy is as follows:

s21, acquiring driving demonstration data of an expert: the driving demonstration data is obtained by sampling and extracting demonstration driving video data, and a section of continuous driving video is sampled according to a certain frequency to obtain a group of track demonstration; an expert demonstration data includes a plurality of traces, collectively denoted as:

wherein D_ERepresenting driving demonstration data as a whole,(s)_j,a_j) Representing a data pair formed by a corresponding state j and a decision command corresponding to the state, M representing the total number of driving demonstration data, N_TRepresenting the number of driving demonstration trajectories, L_iRepresenting the state-decision instruction pairs(s) contained in the ith driving demonstration track_j,a_j) The number of (2);

s22, obtaining a characteristic expected value of a driving demonstration;

first driving demonstration data D_EEach description in (1)State s of driving environment_tInput into a state feature extractor to obtain a corresponding state s_tCharacteristic case of f(s)_t,a_t)，f(s_t,a_t) Means a set of correspondences s_tThen, the characteristic expectation value of the driving demonstration is calculated based on the following formula:

wherein gamma is a discount factor and is correspondingly set according to different problems;

s23, solving a state-action set under a greedy strategy;

and S24, solving the weight of the return function.

Furthermore, the specific steps of solving the state-action set under the greedy strategy are as follows: the return function generator and the driving strategy acquirer are two parts of a cycle; first, a neural network in a driving strategy acquirer is acquired: driving demonstration data D_EExtracting the obtained state feature f(s) describing the environmental condition_t,a_t) Input to the neural network to obtain an output g_w(s_t)；g_w(s_t) Is about describing the state s_tA set of Q values, i.e. [ Q(s) ]_t,a₁),...,Q(s_t,a_n)]^TAnd Q(s)_t,a_i) Representing state-action values for describing the state s in the current driving situation_tNext, a decision-making driving action a is selected_iThe quality of (d) is obtained based on the formula Q (s, a) where θ denotes a weight value in the current reward function and μ (s, a) denotes a feature expectation value.

Then based on an epsilon-greedy strategy, selecting and describing driving scene state s_tCorresponding driving decision actions

Selecting a scene s related to a current driving situation_tDecision action for maximizing Q value in lower Q value set

Otherwise, randomly selecting

Has selected completely

Thereafter, recording the time

Thus demonstrating for driving D_EState feature f(s) of each state in (b)_t,a_t) Inputting the data into the neural network to obtain M state-action pairs(s)_t,a_t) Describing the driving scene state s at time t_tLower selection of driving decision action a_t(ii) a And simultaneously, acquiring Q values of M corresponding state-action pairs based on the condition of action selection, and recording the Q values as Q.

Furthermore, the specific step of obtaining the weight of the reporting function is:

firstly, an objective function is constructed based on the following formula:

representing a loss function, i.e. according to whether the current state-action pair exists in the driving demonstration, if so, it is 0, otherwise, it is 1;

the corresponding state-action values recorded above;

multiplying the driving demonstration feature expectation obtained in the step S22 by the weight value theta of the return function;

is a regular term;

the objective function is minimized by means of a gradient descent method, i.e. t ═ min_θJ (theta), obtaining the variable theta minimizing the objective function, where theta is the weight of the desired reward function.

Further, the process of acquiring the driving strategy by the reward function generator further comprises: s25, based on the obtained corresponding return function weight value theta, according to a formula r (s, a) ═ theta^Tf (s, a) constructing a reward function generator.

As a further step, the specific implementation process of the driving strategy construction completed by the driving strategy acquirer is as follows:

s31 construction of training data of driving strategy acquirer

Training data is acquired, each data comprising two parts: one is driving decision characteristic f(s) obtained by inputting driving scene state at the time t into a driving state extractor_t) The other is obtained based on the following formula

Wherein r is_θ(s_t,a_t) A reward function generated based on driving demonstration data by means of a reward function generator; q^π(s_t,a_t) And Q^π(s_t+1,a_t+1) From the Q values recorded in S23, a driving scene S describing the time t is selected_tAnd selecting a driving scene s in which the t +1 moment is described_t+1The Q value of (1);

s32, establishing a neural network

The neural network comprises three layers, wherein the first layer is used as an input layer, the number of neurons and the output feature types of the feature extractor are the same and are k, and the neural network is used for inputting the features f(s) of the driving scene_t,a_t) The number of the hidden layers of the second layer is 10, and the third layerThe number of neurons in the layer is the same as the number n of driving actions for decision making in the action space; the activation functions of the input layer and the hidden layer are sigmoid functions, i.e.

Namely, the method comprises the following steps:

z＝w⁽¹⁾x＝w⁽¹⁾[1,f_t]^T

h＝sigmoid(z)

g_w(s_t)＝sigmoid(w⁽²⁾[1,h]^T)

wherein w⁽¹⁾The weight value of the hidden layer; f. of_tFor the state s of the driving scene at time t_tI.e. the input of the neural network; z is the network layer output when the hidden layer sigmoid activation function is not passed; h is hidden layer output after the sigmoid activation function; w is a⁽²⁾Is the weight of the output layer;

g of network output_w(s_t) Is the driving scene state s at time t_tQ set of (1), i.e., [ Q(s) ]_t,a₁),...,Q(s_t,a_n)]^TQ in S31^π(s_t,a_t) That is, the state s_tInput neural network, selecting a in the output_tThe term is obtained;

s33, optimizing the neural network

For the optimization of the neural network, the established loss function is a cross entropy cost function, and the formula is as follows:

wherein N represents the number of training data; q^π(s_t,a_t) Will describe the driving scene state s at time t_tInputting the neural network, selecting the corresponding driving decision action a in the output_tThe value obtained by the term;

the numerical value obtained in S31;

is a regular term where W ═ W⁽¹⁾,w⁽²⁾The weight in the neural network is represented by the symbol;

inputting the training data obtained in the S31 into the neural network optimization cost function; and (4) finishing the minimization of the cross entropy cost function by means of a gradient descent method to obtain an optimized neural network, and further obtaining a driving strategy acquirer.

As a further step, the specific implementation process of the arbiter comprises:

considering the current return function generator and the driving strategy acquirer as a whole, checking the t value in the current S22 to see whether t is less than epsilon, wherein epsilon is a threshold value for judging whether a target function meets requirements or not, namely judging whether the return function for acquiring the driving strategy currently meets the requirements or not; the numerical value is set differently according to specific requirements;

when the value of t does not satisfy the formula; the reward function generator needs to be reconstructed, and the neural network needed in the current S23 needs to be replaced by the new neural network which is optimized in S33, namely the neural network is used for generating the state S of the driving scene_tNext, the selected decision-making driving action a_iGood or bad Q(s)_t,a_i) A network of values replaced with a new network structure optimized by the gradient descent method in S33; then reconstructing a return function generator to obtain a driving strategy acquirer, and judging whether the value of t meets the requirement again;

when the formula is satisfied, the current theta is the weight of the required return function; the return function generator meets the requirements, and the driving strategy acquirer also meets the requirements; then, collecting the driving data of a certain driver needing to establish a driver model, namely an environmental scene image and corresponding operation data in the driving process, inputting the driving environmental scene image and the corresponding operation data into a driving environmental feature extractor, and obtaining decision-making features of the current scene; then inputting the extracted features into a return function generator to obtain a return function corresponding to the scene state; and then inputting the collected decision characteristics and the calculated return function into a driving strategy acquirer to obtain a driving strategy corresponding to the driver.

Compared with the prior art, the invention has the beneficial effects that: according to the method for describing the driver decision and establishing the driver behavior model, the neural network is adopted to describe the strategy, and when the neural network parameters are determined, the states and the actions are in one-to-one correspondence, so that the possible conditions of the state-action pairs are not limited to the demonstration track. Therefore, in an actual driving situation, due to the large state space corresponding to various driving scenes caused by weather, scenery and the like, by virtue of the excellent capability of the neural network to approximately express any function, the strategy expression can be approximately regarded as a black box: and outputting a corresponding state-action value by inputting the characteristic value of the state, and selecting an action according to the condition of the output value to obtain a corresponding action. Therefore, the applicability of modeling of the driver behavior by means of reverse reinforcement learning is greatly enhanced, the traditional method tries to fit the demonstration track by means of a certain probability distribution, so that the obtained optimal strategy is still limited by the existing state condition in the demonstration track, and the method can be applied to a new state scene to obtain the corresponding action of the new state scene, so that the generalization capability of the established driver behavior model is greatly improved, the application scene is wider, and the robustness is stronger.

Drawings

FIG. 1 is a new deep convolutional neural network;

FIG. 2 is a driving video sampling diagram;

FIG. 3 is a block diagram of the system workflow;

fig. 4 is a diagram illustrating the neural network structure established in step S32.

Detailed Description

The invention will be further explained with reference to the drawings attached to the specification. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

The present embodiment provides a driver behavior modeling system including:

1. the feature extractor extracts and constructs return function features, and the specific mode is as follows:

s11, sampling a driving video obtained by a camera placed behind a windshield of a vehicle in the driving process of the vehicle, wherein a sampling graph is shown in figure 2.

N groups of pictures of road conditions of different vehicle driving road environments and corresponding steering angle conditions are obtained. The training data are jointly constructed by corresponding driving operation data, wherein the training data comprise N1 straight roads and N2 curved roads, the values of N1 and N2 can be N1> -300 and N2> -3000.

And S12, carrying out related operations of translation, cutting, brightness change and the like on the collected image so as to simulate scenes with different illumination and weather.

S13, constructing a convolutional neural network, taking the processed picture as input, taking the operation data of the corresponding picture as a tag value, and training; and (3) optimizing weight parameters of the neural network by solving the optimal solution of the mean square error loss by adopting an optimization method based on a Nadam optimizer.

The convolutional neural network comprises 1 input layer, 3 convolutional layers, 3 pooling layers and 4 full-connection layers. The input layer is sequentially connected with the first convolution layer and the first pooling layer, then connected with the second convolution layer and the second pooling layer, then connected with the third convolution layer and the third pooling layer, and then sequentially connected with the first full-connection layer, the second full-connection layer, the third full-connection layer and the fourth full-connection layer.

And S14, storing the network structure and the weight of the trained convolutional neural network except the final output layer to establish a new convolutional neural network to complete the state feature extractor.

2. The return function generator acquires a driving strategy, and the specific mode is as follows:

the return function is used as a standard for action selection in the reinforcement learning method, and the quality of the return function plays a decisive role in the acquisition process of the driving strategy, so that the quality of the acquired driving strategy is directly determined, and whether the acquired strategy is the same as the strategy corresponding to the real driving demonstration data or not is directly determined. The formula of the return function is reward ═ theta^Tf(s_t,a_t)，f(s_t,a_t) Indicates a state s at time t in a scene corresponding to a driving environment "surrounding environment of a vehicle_tThe characteristic values influencing the driving decision result are used for describing the scene condition of the environment around the vehicle. And theta represents a group of weights corresponding to the characteristics influencing the driving decision, and the numerical value of the weights shows the proportion of the corresponding environmental characteristics in the return function, so that the importance is embodied. On the basis of the state feature extractor, the weight value theta needs to be solved, so that a return function influencing the driving strategy is constructed.

S21, obtaining driving demonstration data of experts

The driving demonstration data is derived from sample extraction of the demonstration driving video data (as opposed to data used by a previous driving environment feature extractor), and a continuous segment of driving video may be sampled at a frequency of 10hz, resulting in a set of trajectory demonstrations. One expert demonstration should have multiple tracks. Overall notation is:

wherein D_ERepresenting driving demonstration data as a whole,(s)_j,a_j) A data pair representing a corresponding state j (a video picture of the driving environment at the sampled time j) and a decision command corresponding to the state (e.g. a steering angle in a steering command), M representing the total number of driving demonstration data, N_TRepresenting the number of driving demonstration trajectories, L_iRepresenting the state-decision instruction pairs(s) contained in the ith driving demonstration track_j,a_j) Number of (2)

S22, obtaining characteristic expectation of driving demonstration

First driving demonstration data D_EEach state s describing a driving environment condition_tInput state feature extractor for obtaining corresponding state s_tCharacteristic case of f(s)_t,a_t)，f(s_t,a_t) Means a set of correspondences s_tThen calculates the characteristic expectation of the driving demonstration based on the following formula:

where γ is a discount factor, and the reference value may be set to 0.65 according to different problems.

S23, obtaining a state-action set under a greedy strategy

First, the neural network in the driving strategy acquirer in S32 is acquired. (since the reward function generator and the driving strategy acquirer are two parts of a loop, the neural network is the neural network just initialized in S32 at the beginning. with the progress of the loop, each step in the loop is that the construction of a reward function influencing the driving decision is completed once, then the corresponding optimal driving strategy is acquired in the driving strategy acquirer based on the current reward function, whether the criterion for ending the loop is met is judged, if not, the optimized neural network in the current S34 is put into the reconstruction of the reward function)

Driving demonstration data D_EExtracting the obtained state feature f(s) describing the environmental condition_t,a_t) Input to a neural network to obtain an output g_w(s_t)；g_w(s_t) Is about describing the state s_tA set of Q values, i.e. [ Q(s) ]_t,a₁),...,Q(s_t,a_n)]^TAnd Q(s)_t,a_i) Representing state-action values for describing the state s in the current driving situation_tNext, a decision-making driving action a is selected_iThe quality of (d) can be obtained based on the formula Q (s, a) ═ θ · μ (s, a), where θ denotes the weight in the current reward function and μ (s, a) denotes the desired feature.

Then, based on an epsilon-greedy strategy, if epsilon is set to be 0.5, a driving scene state s is selected and described_tCorresponding driving decision actions

That is, there is a fifty percent likelihood of picking a score for the current driving scenario s_tDecision action for maximizing Q value in lower Q value set

Otherwise, randomly selecting

Has selected completely

Thereafter, recording the time

Thus demonstrating for driving D_EState feature f(s) of each state in (b)_t,a_t) Inputting the data into the neural network to obtain M state-action pairs(s)_t,a_t) Which describes the driving scene state s at time t_tLower selection of driving decision action a_t. And simultaneously, acquiring Q values of M corresponding state-action pairs based on the condition of action selection, and recording the Q values as Q.

S24, weight value of the return function is obtained

Firstly, an objective function is constructed based on the following formula:

represents a penalty function, i.e., a function that is based on whether a current state-action pair exists in the driving demonstration, 0 if it exists, and 1 if it does not exist.

The corresponding state-action values recorded above.

Is the product of the driving demonstration feature expectation obtained in S22 and the weight value θ of the reward function.

To be a regular term, to prevent the over-fitting problem from occurring, this γ may be 0.9.

S25, based on the obtained corresponding return function weight value theta, according to a formula r (s, a) ═ theta^Tf (s, a) constructing a reward function generator.

3. The driving strategy acquirer completes construction of a driving strategy, and the specific mode is as follows:

construction of training data of S31 driving strategy acquirer

Training data is acquired. The data comes from sampling the previous exemplary data, but needs to be processed to get a set of new types of data, N in total. Each of the data includes two parts: one is driving decision characteristic f(s) obtained by inputting driving scene state at the time t into a driving state extractor_t) The other is obtained based on the following formula

The formula includes a parameter r_θ(s_t,a_t) A reward function generated by a reward function generator based on driving demonstration data. Q^π(s_t,a_t) And Q^π(s_t+1,a_t+1) From the set of Q values Q recorded in S23, a driving scene S describing the time t is selected_tAnd selecting a driving scene s in which the t +1 moment is described_t+1The Q value of (1).

S32, establishing a neural network

The neural network comprises three layers, wherein the first layer is used as an input layer, the number of neurons and the output feature types of the feature extractor are the same and are k, and the neural network is used for inputting the features f(s) of the driving scene_t,a_t) Second, secondThe number of hidden layers of the layer is 10, and the number of neurons of the third layer is the same as the number n of driving actions for decision making in the action space; the activation functions of the input layer and the hidden layer are sigmoid functions, i.e.

Namely, the method comprises the following steps:

z＝w⁽¹⁾x＝w⁽¹⁾[1,f_t]^T

h＝sigmoid(z)

g_w(s_t)＝sigmoid(w⁽²⁾[1,h]^T)

wherein w⁽¹⁾The weight of the hidden layer is represented; f. of_tState s denoting the driving scenario at time t_tI.e. the input of the neural network; z represents the output of the network layer when the hidden layer sigmoid activation function is not passed; h represents hidden layer output after sigmoid activation function; w is a⁽²⁾The weight of the output layer is represented; the network structure is as shown in FIG. 3:

g of network output_w(s_t) Is the driving scene state s at time t_tQ set of (1), i.e., [ Q(s) ]_t,a₁),...,Q(s_t,a_n)]^TQ in S31^π(s_t,a_t) That is, the state s_tInput neural network, selecting a in the output_tThe term is obtained.

S33, optimizing the neural network

wherein N denotes the number of training data. Q^π(s_t,a_t) That is, the driving scene state s at time t will be described_tInputting the neural network, selecting the corresponding driving decision action a in the output_tThe numerical values obtained by the terms.

The values obtained in S31.

Also a regular term, is set to prevent overfitting. The γ may be 0.9. Wherein W ═ { W ═ W⁽¹⁾,w⁽²⁾And the weights in the neural network are referred to.

The training data obtained in S31 is input to the neural network optimization cost function. And (4) finishing the minimization of the cross entropy cost function by means of a gradient descent method to obtain an optimized neural network, and obtaining a driving strategy acquirer.

4. The judger judges whether the optimal driving strategy constructed by the acquirer meets judgment standards or not; if the driving strategy does not meet the judgment standard, reconstructing a return function, repeatedly constructing an optimal driving strategy, and repeatedly iterating until the judgment standard is met; finally, a driving strategy describing a real driving demonstration is obtained.

The current return function generator and the driving strategy acquirer are considered as a whole, the t value in the current S22 is checked, whether t is less than epsilon or not is met, and epsilon is a threshold value for judging whether a target function meets requirements or not, namely whether the return function for acquiring the driving strategy currently meets the requirements or not is judged. The numerical value is set differently according to specific needs.

When t is a value that does not satisfy the formula. The reward function generator needs to be reconstructed, and the neural network needed in the current S23 needs to be replaced by the new neural network which is optimized in S33, namely the neural network is used for generating the state S of the driving scene_tNext, the selected decision-making driving action a_iGood or bad Q(s)_t,a_i) The network of values is replaced with a new network structure optimized by the gradient descent method in S33. And then reconstructing a return function generator to obtain a driving strategy acquirer, and judging whether the value of t meets the requirement again.

When the formula is satisfied, the current θ is the weight of the desired reward function. The return function generator meets the requirements, and the driving strategy acquirer also meets the requirements. Thus, it is possible to: the method comprises the steps of collecting driving data of a certain driver needing to establish a driver model, namely an environment scene image and corresponding operation data in the driving process, such as a driving steering angle. And inputting the driving environment characteristic extractor to obtain the decision characteristic of the current scene. And then inputting the extracted features into a return function generator to obtain a return function corresponding to the scene state. And then inputting the collected decision characteristics and the calculated return function into a driving strategy acquirer to obtain a driving strategy corresponding to the driver.

In the markov decision process, one strategy requires a connection state to its corresponding action. However, when a state space with a large range exists, it is difficult to describe a certain strategic representation for an unretraversed area, the description of the certain strategic representation is omitted in the conventional method, the probability model of the whole trajectory distribution is described based on an exemplary trajectory, and no specific strategic representation is given for a new state, that is, no specific method is given for the possibility of taking certain action for the new state. In the invention, the strategy is described by means of a neural network, and the neural network has excellent generalization capability because the neural network can approximately represent the characteristics of any function at any accuracy. By means of the representation of the state features, states which are not contained in the exemplary trajectories can be represented on the one hand, and, in addition, by means of the input of corresponding state features into the neural network. The corresponding action value can be obtained, so that the obtained action is obtained according to a strategy, and the problem that the traditional method cannot generalize the driving demonstration data to the state of the driving scene which is not traversed is solved.

The above description is only for the purpose of creating a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the technical scope of the present invention.

Claims

1. The driver behavior modeling system is characterized by specifically comprising:

a return function generator for obtaining a driving strategy;

the driving strategy acquirer completes construction of a driving strategy;

the judger judges whether the optimal driving strategy constructed by the acquirer meets judgment standards or not; if the driving strategy does not meet the judgment standard, reconstructing a return function, repeatedly constructing an optimal driving strategy, and repeatedly iterating until the judgment standard is met;

the specific implementation process of extracting and constructing the return function features by the feature extractor is as follows:

s11, in the driving process of the vehicle, a camera placed behind a windshield of the vehicle is used for sampling a driving video to obtain N groups of pictures of road conditions of different vehicle driving environments and corresponding steering angle conditions; simultaneously, corresponding to driving operation data, training data are jointly constructed;

s14, storing the network structure and the weight of the trained convolutional neural network to establish a new convolutional neural network and complete the state feature extractor;

the concrete implementation process of the return function generator for obtaining the driving strategy is as follows:

D_E＝{(s₁,a₁),(s₂,a₂),...,(s_M,a_M)}

s22, obtaining a characteristic expected value of a driving demonstration;

first driving demonstration data D_EEach state s describing a driving environment condition_tInput into a state feature extractor to obtain a corresponding state s_tCharacteristic case of f(s)_t,a_t)，f(s_t,a_t) Means a set of correspondences s_tThen, the characteristic expectation value of the driving demonstration is calculated based on the following formula:

s23, solving a state-action set under a greedy strategy;

and S24, solving the weight of the return function.

2. The driver behavior modeling system of claim 1, wherein the convolutional neural network established in step S13 comprises 1 input layer, 3 convolutional layers, 3 pooling layers, 4 fully-connected layers; the input layer is sequentially connected with the first convolution layer and the first pooling layer, then connected with the second convolution layer and the second pooling layer, then connected with the third convolution layer and the third pooling layer, and finally sequentially connected with the first full-connection layer, the second full-connection layer, the third full-connection layer and the fourth full-connection layer.

3. The driver behavior modeling system of claim 1, wherein the trained convolutional neural network of step S14 does not include an output layer.

4. The driver behavior modeling system of claim 1, wherein the specific step of finding the set of state-actions under a greedy strategy is: the return function generator and the driving strategy acquirer are two parts of a cycle; first, a neural network in a driving strategy acquirer is acquired: driving demonstration data D_EExtracting the obtained state feature f(s) describing the environmental condition_t,a_t) Input to a neural network to obtain an output g_w(s_t)；g_w(s_t) Is about describing the state s_tA set of Q values, i.e. [ Q(s) ]_t,a₁),...,Q(s_t,a_n)]^TAnd Q(s)_t,a_i) Representing state-action values for describing the state s in the current driving situation_tNext, a decision-making driving action a is selected_iThe quality of (2) is obtained based on a formula Q (s, a) ═ theta · mu (s, a), wherein theta refers to the weight in the current reward function, and mu (s, a) refers to the expected feature value;

Otherwise, randomly selecting

Has selected completely

Thereafter, recording the time

5. The driver behavior modeling system of claim 1, wherein the specific step of weighting the reward function is:

firstly, an objective function is constructed based on the following formula:

the corresponding state-action values recorded above;

is a regular term;

the objective function is minimized by means of a gradient descent method, i.e. t ═ min_θJ (theta), obtaining a variable theta that minimizes the objective function, where theta isAnd calculating the weight of the required return function.

6. The driver behavior modeling system of claim 1, wherein the reward function generator obtaining the driving strategy concrete implementation further comprises: s25, based on the obtained corresponding return function weight value theta, according to a formula r (s, a) ═ theta^Tf (s, a) constructing a reward function generator.

7. The driver behavior modeling system of claim 1, wherein the driving strategy acquirer implements the driving strategy construction by:

s31 construction of training data of driving strategy acquirer

s32, establishing a neural network

The neural network comprises three layers, wherein the first layer is used as an input layer, the number of neurons and the output feature types of the feature extractor are the same and are k, and the neural network is used for inputting the features f(s) of the driving scene_t,a_t) The number of hidden layers in the second layer is 10, and the number of neurons in the third layer and the action space are performedThe number n of the decided driving actions is the same; the activation functions of the input layer and the hidden layer are sigmoid functions, i.e.

Namely, the method comprises the following steps:

z＝w⁽¹⁾x＝w⁽¹⁾[1,f_t]^T

h＝sigmoid(z)

g_w(s_t)＝sigmoid(w⁽²⁾[1,h]^T)

s33, optimizing the neural network

the numerical value obtained in S31;

8. The driver behavior modeling system of claim 5, wherein the determiner implementation comprises:

the current return function generator and the driving strategy acquirer are considered as a whole, whether a t value meets t which is less than epsilon or not is checked, and epsilon is a threshold value for judging whether a target function meets requirements or not, namely whether the return function for acquiring the driving strategy currently meets the requirements or not is judged; the numerical value is set differently according to specific requirements;