CN108791302A

CN108791302A - Driving behavior modeling

Info

Publication number: CN108791302A
Application number: CN201810662040.0A
Authority: CN
Inventors: 邹启杰; 李昊宇; 裴腾达
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2018-11-13
Anticipated expiration: 2038-06-25
Also published as: CN108791302B

Abstract

本发明公开了一种驾驶员行为建模系统，具体包括特征提取器，提取构建回报函数特征；回报函数生成器，获取构建驾驶策略所需的回报函数；驾驶策略获取器，完成驾驶策略的构建；判定器，判断获取器构建的最优驾驶策略，其是否满足评判标准；若不满足，则重新构建回报函数，重复构建最优驾驶策略，反复迭代，直到满足评判标准；最终获得描述真实驾驶示范的驾驶策略。本申请可以对于新的状态场景进行适用，来获得其对应动作，大大提高了建立的驾驶员行为模型的泛化能力，适用场景更广，鲁棒性更强。The invention discloses a driver behavior modeling system, which specifically includes a feature extractor for extracting and constructing reward function features; a reward function generator for obtaining the reward function required for constructing a driving strategy; a driving strategy acquirer for completing the construction of the driving strategy ; The determiner judges whether the optimal driving strategy constructed by the acquirer meets the evaluation criteria; if not, rebuilds the reward function, repeats the construction of the optimal driving strategy, and iterates repeatedly until the evaluation criteria are met; finally obtains a description of the real driving Demonstrate driving strategies. This application can be applied to new state scenarios to obtain corresponding actions, which greatly improves the generalization ability of the established driver behavior model, and has wider applicable scenarios and stronger robustness.

Description

Driver Behavior Modeling System

技术领域technical field

本发明涉及一种建模方法，具体说是驾驶员行为建模系统。The invention relates to a modeling method, in particular to a driver behavior modeling system.

背景技术Background technique

自主驾驶作为智能交通领域的一个重要部分。受当前技术等原因，自主车依旧需要智能驾驶系统(智能辅助驾驶系统)和人类驾驶员相互协作以完成驾驶任务。而在这个过程中，无论是更好的量化驾驶员的信息以供智能系统决策，还是通过区分驾驶员的不同为人们提供个性化的服务，驾驶员建模都是必不可少的重要步骤。Autonomous driving is an important part of the field of intelligent transportation. Due to current technology and other reasons, autonomous vehicles still require an intelligent driving system (intelligent assisted driving system) and human drivers to cooperate with each other to complete driving tasks. In this process, driver modeling is an essential and important step, whether it is to better quantify driver information for intelligent system decision-making, or to provide people with personalized services by distinguishing different drivers.

在当前有关驾驶员建模的方法中，强化学习方法因为对于驾驶员在车辆驾驶这样具有大规模连续空间以及多个优化目标的复杂序贯决策问题，有着很好的解决效果，于是也是一种针对驾驶员行为建模有效的方法。强化学习作为基于MDP的问题解决方法，需要和环境交互，采取行动以获取来自环境的评价性质的反馈信号即回报(reward)，并使长期的回报最大化。In the current method of driver modeling, the reinforcement learning method has a good solution to the complex sequential decision-making problem with large-scale continuous space and multiple optimization objectives such as the driver's driving in the vehicle, so it is also a kind of Efficient method for modeling driver behavior. Reinforcement learning, as a problem-solving method based on MDP, needs to interact with the environment, take actions to obtain feedback signals from the environment, that is, rewards, and maximize long-term rewards.

通过对于现有文献的检索发现，现有的对于驾驶员行为建模中，对于回报函数的设置方法，主要包括：传统的由研究人员手动针对不同场景状态进行设置的方法，以及借助逆向强化学习的方法来设置的方法。传统的方法对于研究人员的主观性依赖极大，回报函数的好坏取决于研究人员的能力与经验。同时因为在车辆行驶过程中，为了正确的设置回报函数，需要平衡大量的决策变量，这些变量存在极大的不可共度性甚至矛盾性，而研究人员往往无法设计出能够平衡各项需求的回报函数。Through the search of existing literature, it is found that the existing methods for setting the reward function in driver behavior modeling mainly include: the traditional method of setting manually by researchers for different scene states, and the use of reverse reinforcement learning method to set the method. Traditional methods rely heavily on the subjectivity of researchers, and the quality of the reward function depends on the ability and experience of researchers. At the same time, because in the process of vehicle driving, in order to correctly set the reward function, it is necessary to balance a large number of decision variables. These variables have great incompatibility or even contradiction, and researchers often cannot design a reward function that can balance various requirements. .

而逆向强化学习借助驾驶示范数据，为各类驾驶特征分配适合的权重，可以自动学习得到所需要的回报函数，进而解决了原有的人为决策的不足。但传统逆向强化学习方法，只能对于驾驶示范数据中已存在的场景状态进行学习，而实际驾驶的时候，因为天气、景物等因素的不同，真实驾驶场景往往超越驾驶示范范围。因而，逆向强化学习的方法解决将驾驶示范数据中场景和决策动作的关系表现出泛化能力不足的问题。With the help of driving demonstration data, reverse reinforcement learning assigns appropriate weights to various driving characteristics, and can automatically learn to obtain the required reward function, thereby solving the shortcomings of the original human decision-making. However, the traditional reverse reinforcement learning method can only learn the existing scene states in the driving demonstration data. In actual driving, due to different factors such as weather and scenery, the real driving scene often exceeds the driving demonstration range. Therefore, the method of reverse reinforcement learning solves the problem that the relationship between scenes and decision-making actions in driving demonstration data shows insufficient generalization ability.

现有基于强化学习理论的驾驶员行为建模方法主要有两种思路：思路一，采用传统强化学习的方法，其回报函数的设置依赖研究人员对于场景的分析、整理、筛选和归纳，进而获得到一系列驾驶决策有关的特征，如：车前距、是否远离路缘、是否远离行人、合理速度、变道频率等；再根据驾驶场景需求，设计一系列的实验来求取这些特征在对应场景环境下的回报函数中的权重占比，最后完成对于回报函数的整体设计，而作为描述驾驶员驾驶行为的模型。思路二，基于概率模型建模方法，采用最大熵逆向强化学习求解驾驶行为特征函数。首先假定存在潜在的、特定的一个概率分布，产生了驾驶的示范轨迹；进而，需要来找到一个能够拟合驾驶示范的概率分布，而求取这个概率分布的问题可转化为非线性规划问题，即：The existing driver behavior modeling methods based on reinforcement learning theory mainly have two ideas: the first idea is to adopt the traditional reinforcement learning method, and the setting of the reward function depends on the analysis, sorting, screening and induction of the scene by the researchers, and then obtains A series of characteristics related to driving decisions, such as: the distance in front of the car, whether it is far away from the curb, whether it is far away from pedestrians, reasonable speed, lane change frequency, etc.; and then according to the needs of driving scenarios, a series of experiments are designed to obtain these characteristics in the corresponding The weight ratio in the reward function in the scene environment, and finally complete the overall design of the reward function, and use it as a model to describe the driver's driving behavior. Idea 2: Based on the probabilistic model modeling method, maximum entropy reverse reinforcement learning is used to solve the driving behavior characteristic function. First, it is assumed that there is a potential and specific probability distribution, which generates a driving demonstration trajectory; then, it is necessary to find a probability distribution that can fit the driving demonstration, and the problem of obtaining this probability distribution can be transformed into a nonlinear programming problem. which is:

max-plogpmax-plogp

∑P＝1ΣP=1

P代指的就是示范轨迹的概率分布，通过上面的式子求解得到概率分布后，由P refers to the probability distribution of the demonstration trajectory. After the probability distribution is obtained by solving the above formula, it is determined by

求取得到相关参数，即可求得回报函数r＝θ^Tf(s_t)。After obtaining relevant parameters, the reward function r=θ ^T f(s _t ) can be obtained.

传统的驾驶员驾驶行为模型，利用已知驾驶数据分析、描述和推理驾驶行为，然而已采集的驾驶数据并不能完全覆盖无穷无尽的驾驶行为特征，更不可能获取全部状态对应动作的情况。在实际驾驶场景下，因为天气、场景、物体的不同，驾驶状态有着众多可能，遍历全部的状态是不可能的事情。因此传统驾驶员驾驶行为模型泛化能力弱，模型假设条件多，鲁棒性差。The traditional driver's driving behavior model uses known driving data to analyze, describe and reason about driving behavior. However, the collected driving data cannot completely cover the endless driving behavior characteristics, and it is even more impossible to obtain the corresponding actions of all states. In the actual driving scene, due to the different weather, scenes, and objects, there are many possible driving states, and it is impossible to traverse all states. Therefore, the generalization ability of the traditional driver's driving behavior model is weak, the model assumes many conditions, and the robustness is poor.

其次，在实际的驾驶问题中，仅凭研究人员设置回报函数的方法，需要平衡太多对于各种特征的需求，全完全依赖研究人员的经验设置，反复手动调解，耗时耗力，更致命的是过于主观。在不同场景和环境下，研究人员则需要面对太多的场景状态；同时，即使是针对某一个确定的场景状态，需求的不同，也会导致驾驶行为特征的变化。为了准确描述该驾驶任务就要分配一系列权重以准确描述这些因素。现有方法中，基于概率模型的逆向强化学习主要从现有的示范数据出发，把示范数据作为已有数据，进而寻求对应当前数据的分布情况，基于此才能求取对应状态下的动作选取。但已知数据的分布并不能表示全部数据的分布，正确获取分布，需要获取全部状态对应动作的情况。Secondly, in the actual driving problem, only relying on the method of setting the reward function by the researchers needs to balance too many requirements for various features, all completely relying on the experience settings of the researchers, repeated manual adjustments, time-consuming and labor-intensive, and more deadly is too subjective. In different scenarios and environments, researchers need to face too many scene states; at the same time, even for a certain scene state, different requirements will lead to changes in driving behavior characteristics. In order to accurately describe the driving task, a series of weights must be assigned to accurately describe these factors. In the existing method, the reverse reinforcement learning based on the probability model mainly starts from the existing demonstration data, takes the demonstration data as the existing data, and then seeks the distribution corresponding to the current data, and based on this, the action selection in the corresponding state can be obtained. However, the distribution of known data cannot represent the distribution of all data. To obtain the distribution correctly, it is necessary to obtain the corresponding actions of all states.

发明内容Contents of the invention

为解决驾驶员建模泛化性弱的问题，即现有技术中所存在的对于驾驶场景不在示范数据的情况下，无法建立对应的回报函数来进行驾驶员行为建模的技术问题，本申请提供了驾驶员行为建模系统，可以对于新的状态场景进行适用，来获得其对应动作，大大提高了建立的驾驶员行为模型的泛化能力，适用场景更广，鲁棒性更强。In order to solve the problem of weak generalization of driver modeling, that is, the technical problem in the prior art that it is impossible to establish a corresponding reward function for driver behavior modeling when the driving scene does not include demonstration data, this application Provides a driver behavior modeling system, which can be applied to new state scenarios to obtain its corresponding actions, which greatly improves the generalization ability of the established driver behavior model, with wider applicable scenarios and stronger robustness.

为了实现上述目的，本发明方案的技术要点是：驾驶员行为建模系统，具体包括：In order to achieve the above object, the technical gist of the solution of the present invention is: a driver behavior modeling system, specifically comprising:

特征提取器，提取构建回报函数特征；Feature extractor, extracting and constructing reward function features;

回报函数生成器，获取驾驶策略；Reward function generator to obtain driving strategy;

驾驶策略获取器，完成驾驶策略的构建；The driving strategy acquirer completes the construction of the driving strategy;

判定器，判断获取器构建的最优驾驶策略，其是否满足评判标准；若不满足，则重新构建回报函数，重复构建最优驾驶策略，反复迭代，直到满足评判标准；最终获得描述真实驾驶示范的驾驶策略。Determiner, which judges whether the optimal driving strategy constructed by the acquirer meets the evaluation criteria; if not, then rebuilds the reward function, repeats the construction of the optimal driving strategy, and iterates repeatedly until the evaluation criteria are met; finally obtains a description of the real driving demonstration driving strategy.

进一步地，特征提取器提取构建回报函数特征的具体实现过程是：Further, the specific implementation process of the feature extractor to extract and construct the features of the reward function is:

S11.在车辆行驶过程中，利用放在车辆挡风玻璃后面的摄像机对驾驶视频进行采样，获得N组不同车辆驾驶环境道路情况的图片；同时对应驾驶操作数据，即该道路环境下的转向角情况，联合构建起来训练数据；S11. During the driving process of the vehicle, use the camera placed behind the windshield of the vehicle to sample the driving video, and obtain N groups of pictures of road conditions in different vehicle driving environments; at the same time, corresponding to the driving operation data, that is, the steering angle under the road environment situation, jointly build up the training data;

S12.对收集来的图片进行平移、裁剪、变更亮度操作，以模拟不同光照和天气的场景；S12. Translating, cropping, and changing brightness operations on the collected pictures to simulate scenes with different lighting and weather;

S13.构建卷积神经网络，将经过处理后的图片作为输入，对应图片的操作数据作为标签值，进行训练，采用基于Nadam优化器的优化方法对均方误差损失求最优解来优化神经网络的权重参数；S13. Construct a convolutional neural network, use the processed picture as input, and use the operation data of the corresponding picture as the label value for training, and use the optimization method based on the Nadam optimizer to find the optimal solution for the mean square error loss to optimize the neural network. The weight parameter;

S14.将训练完成后的卷积神经网络的网络结构和权值保存，以建立新的一个卷积神经网络，完成状态特征提取器。S14. Save the network structure and weights of the trained convolutional neural network to build a new convolutional neural network and complete the state feature extractor.

进一步地，步骤S13中建立的卷积神经网络包括1个输入层、3个卷积层、3个池化层、4个全连接层；输入层依次连接第一个卷积层、第一个池化层，然后连接第二个卷积层、第二个池化层，再连接第三个卷积层、第三个池化层，最后依次连接第一个全连接层、第二个全连接层、第三个全连接层、第四个全连接层。Further, the convolutional neural network established in step S13 includes 1 input layer, 3 convolutional layers, 3 pooling layers, and 4 fully connected layers; the input layer is sequentially connected to the first convolutional layer, the first Pooling layer, then connect the second convolutional layer, the second pooling layer, then connect the third convolutional layer, the third pooling layer, and finally connect the first fully connected layer, the second fully connected layer The connection layer, the third fully connected layer, and the fourth fully connected layer.

进一步地，步骤S14中的训练完成后的卷积神经网络不包括输出层。Further, the convolutional neural network after the training in step S14 does not include an output layer.

进一步地，回报函数生成器获取驾驶策略具体实现过程是：Further, the specific implementation process of the reward function generator to obtain the driving strategy is:

S21.获得专家的驾驶示范数据：驾驶示范数据来自对于示范驾驶视频数据的采样提取，按照一定频率对一段连续的驾驶视频进行采样，得到一组轨迹示范；一个专家示范数据包括多条轨迹，总体记做：S21. Obtaining expert driving demonstration data: the driving demonstration data comes from the sampling and extraction of demonstration driving video data, and a continuous driving video is sampled according to a certain frequency to obtain a set of trajectory demonstration; one expert demonstration data includes multiple trajectories, and the overall Remember to do:

其中D_E表示整体的驾驶示范数据，(s_j,a_j)表示对应状态j和该状态对应决策指令构成的数据对，M代表总共的驾驶示范数据的个数，N_T代表驾驶示范轨迹数目，L_i代表第i条驾驶示范轨迹中包含的状态-决策指令对(s_j,a_j)的个数； Among them, D _E represents the overall driving demonstration data, (s _j , a _j ) represents the data pair consisting of the corresponding state j and the corresponding decision instruction of the state, M represents the total number of driving demonstration data, _NT represents the number of driving demonstration trajectories , L _i represents the number of state-decision instruction pairs (s _j , a _j ) contained in the i-th driving demonstration trajectory;

S22.求取驾驶示范的特征期望值；S22. Obtaining the characteristic expectation value of the driving demonstration;

首先将驾驶示范数据D_E中的各个描述驾驶环境情况的状态s_t输入状态特征提取器中，获得对应状态s_t下的特征情况f(s_t,a_t)，f(s_t,a_t)代指一组对应s_t的影响驾驶决策结果的驾驶环境场景特征值，然后基于下述公式计算出来驾驶示范的特征期望值：Firstly, each state s _t describing the driving environment in the driving demonstration data DE is input into the state feature extractor to obtain the characteristic situation _f (s _t , a _t ) in the corresponding state s _t , f(s _t , a _t ) refers to a set of characteristic values of the driving environment scene corresponding to s _t that affect the driving decision-making result, and then calculate the characteristic expected value of the driving demonstration based on the following formula:

其中γ为折扣因子，根据问题的不同，对应进行设置；Among them, γ is the discount factor, which should be set according to different problems;

S23.求取贪婪策略下的状态-动作集；S23. Obtain the state-action set under the greedy strategy;

S24.求取回报函数的权值。S24. Calculate the weight of the reward function.

更进一步地，求取贪婪策略下的状态-动作集的具体步骤是：由于回报函数生成器与驾驶策略获取器是循环的两部分；首先，获取驾驶策略获取器中的神经网络：把驾驶示范数据D_E提取得到的描述环境情况的状态特征f(s_t,a_t)输入神经网络，得到输出g_w(s_t)；g_w(s_t)是关于描述状态s_t的一组Q值集合，即[Q(s_t,a₁),...,Q(s_t,a_n)]^T，而Q(s_t,a_i)代表状态-动作值，用于描述在当前驾驶场景状态s_t下，选取决策驾驶动作a_i的优劣，基于公式Q(s,a)＝θ·μ(s,a)进行求得，该公式中的θ代指当前回报函数中的权值，μ(s,a)代指特征期望值。Furthermore, the specific steps to obtain the state-action set under the greedy strategy are: since the reward function generator and the driving strategy obtainer are two parts of the cycle; first, obtain the neural network in the driving strategy obtainer: the driving demonstration The state feature _f (s _t , a _t ) obtained by extracting the data DE and describing the environment is input into the neural network, and the output g _w (st _t ) is obtained; g _w (st _t ) is a set of Q values describing the state s _t Set, namely [Q(s _t ,a ₁ ),...,Q(s _t ,a _n )] ^T , and Q(s _t ,a _i ) represents the state-action value, used to describe the current driving scene In the state s _t , select the quality of the decision-making driving action a _i , and obtain it based on the formula Q(s,a)=θ·μ(s,a), where θ in the formula refers to the weight in the current reward function , μ(s,a) refers to the expected value of the feature.

然后基于ε-greedy策略，进行选取描述驾驶场景状态s_t对应的驾驶决策动作选取关于当前驾驶场景s_t下的Q值集合中让Q值最大的决策动作否则，则随机选取选取完之后，记录此时的 Then based on the ε-greedy strategy, select and describe the driving decision-making action corresponding to the driving scene state s _t Select the decision-making action that maximizes the Q value in the Q value set under the current driving scene s _t Otherwise, randomly select selected Afterwards, record the

如此对于驾驶示范D_E中的每个状态的状态特征f(s_t,a_t)，输入该神经网络，共获取得到M个状态-动作对(s_t,a_t)，其描述了t时刻的驾驶场景状态s_t下选取驾驶决策动作a_t；同时基于动作选取的情况，获取了M个对应状态-动作对的Q值，记做Q。In this way, for the state feature _f (s _t , a _t ) of each state in the driving demonstration DE, input the neural network, and obtain a total of M state-action pairs (s _t , a _t ), which describe the time t The driving decision-making action a _t is selected under the driving scene state s _t ; at the same time, based on the action selection situation, the Q values of M corresponding state-action pairs are obtained, denoted as Q.

更进一步地，求取回报函数的权值具体步骤是：Furthermore, the specific steps to obtain the weight of the reward function are:

首先基于下面公式，构建目标函数：First, construct the objective function based on the following formula:

代表损失函数，即依据当前状态-动作对是否存在于驾驶示范之中，若存在则为0，否则为1；为上面所记录的对应状态-动作值；为S22中求取的驾驶示范特征期望和回报函数的权值θ的乘积；为正则项； Represents the loss function, that is, according to whether the current state-action pair exists in the driving demonstration, if it exists, it is 0, otherwise it is 1; is the corresponding state-action value recorded above; Be the product of the weight θ of the driving demonstration feature expectation and reward function obtained in S22; is a regular item;

借助梯度下降法最小化该目标函数，即t＝min_θJ(θ)，获取令该目标函数最小化的变量θ，该θ即所求取的所需的回报函数的权值。The objective function is minimized by means of the gradient descent method, that is, t=min _θ J(θ), and the variable θ that minimizes the objective function is obtained, and the θ is the weight of the required reward function.

更进一步地，回报函数生成器获取驾驶策略具体实现过程还包括：S25.基于获得的对应回报函数权值θ，根据公式r(s,a)＝θ^Tf(s,a)构建回报函数生成器。Furthermore, the specific implementation process of obtaining the driving strategy by the reward function generator also includes: S25. Based on the obtained corresponding reward function weight θ, construct the reward function generation according to the formula r(s, a)=θ ^T f(s, a) device.

作为更进一步地，驾驶策略获取器完成驾驶策略构建的具体实现过程为：As a further step, the specific implementation process for the driving strategy acquirer to complete the construction of the driving strategy is as follows:

S31构建驾驶策略获取器的训练数据S31 Construct the training data of the driving strategy acquirer

获取训练数据，每个数据包括两部分：一个是将t时刻驾驶场景状态输入驾驶状态提取器得到的驾驶决策特征f(s_t)，另一个就是基于下面公式得到的 Obtain training data, each data includes two parts: one is the driving decision feature f(s _t ) obtained by inputting the driving scene state at time t into the driving state extractor, and the other is obtained based on the following formula

其中，r_θ(s_t,a_t)借助回报函数生成器基于驾驶示范数据生成的回报函数；Q^π(s_t,a_t)和Q^π(s_t+1,a_t+1)来自于S23中所记录的Q值，选取其中描述t时刻驾驶场景s_t的Q值和选取其中描述t+1时刻驾驶场景s_t+1的Q值；Among them, r _θ (s _t , a _t ) is the reward function generated based on the driving demonstration data by means of the reward function generator; Q ^π (s _t , a _t ) and Q ^π (s _t+1 , a _t+1 ) come from For the Q value recorded in S23, select the Q value that describes the driving scene s _t at time t and select the Q value that describes the driving scene s _t+ 1 at time t+1;

S32.建立神经网络S32. Establish neural network

神经网络包括三层，第一层作为输入层，其中的神经元个数和特征提取器的输出特征种类相同为k个，用于输入驾驶场景的特征f(s_t,a_t)，第二层的隐层个数为10个，第三层的神经元个数和动作空间中进行决策的驾驶动作个数n相同；输入层和隐层的激活函数都为sigmoid函数，即即有着：The neural network consists of three layers, the first layer is used as the input layer, and the number of neurons in it is the same as the output feature type of the feature extractor, which is k, which is used to input the feature f(s _t , a _t ) of the driving scene, and the second layer The number of hidden layers in the first layer is 10, and the number of neurons in the third layer is the same as the number n of driving actions for decision-making in the action space; the activation functions of the input layer and the hidden layer are both sigmoid functions, that is i.e. have:

z＝w⁽¹⁾x＝w⁽¹⁾[1,f_t]^T z=w ⁽¹⁾ x=w ⁽¹⁾ [1,f _t ] ^T

h＝sigmoid(z)h=sigmoid(z)

g_w(s_t)＝sigmoid(w⁽²⁾[1,h]^T)g _w (s _t )=sigmoid(w ⁽²⁾ [1,h] ^T )

其中w⁽¹⁾为隐层的权值；f_t为t时刻驾驶场景的状态s_t的特征，也就是神经网络的输入；z为未经过隐层sigmoid激活函数时候的网络层输出；h为经过sigmoid激活函数后的隐层输出；w⁽²⁾为输出层的权值；where w ⁽¹⁾ is the weight of the hidden layer; f _t is the feature of the state s _t of the driving scene at time t, that is, the input of the neural network; z is the output of the network layer without the sigmoid activation function of the hidden layer; h is Hidden layer output after sigmoid activation function; w ⁽²⁾ is the weight of the output layer;

网络输出的g_w(s_t)是t时刻驾驶场景状态s_t的Q集合，即[Q(s_t,a₁),...,Q(s_t,a_n)]^T，S31中的Q^π(s_t,a_t)就是将状态s_t输入神经网络，选择输出中的a_t项所得到；The g _w (s _t ) output by the network is the Q set of the driving scene state s _t at time t, that is, [Q( _st ,a ₁ ),...,Q(s _t ,a _n )] ^T , in S31 Q ^π (s _t , a _t ) is obtained by inputting the state s _t into the neural network and selecting the item a _t in the output;

S33.优化神经网络S33. Optimizing the neural network

对于该神经网络的优化，建立的损失函数是交叉熵代价函数，公式如下：For the optimization of the neural network, the established loss function is the cross-entropy cost function, and the formula is as follows:

其中N代指训练数据的个数；Q^π(s_t,a_t)是将描述t时刻驾驶场景状态s_t输入神经网络，选择输出中的对应驾驶决策动作a_t项所得到的数值；为S31中求得的数值；是正则项，其中的W＝{w⁽¹⁾,w⁽²⁾}代指上面神经网络中权值；Among them, N refers to the number of training data; Q ^π ( _st , a _t ) is the value obtained by inputting the driving scene state s _t describing the driving scene at time t into the neural network, and selecting the corresponding driving decision-making action a _t item in the output; is the value obtained in S31; is a regular term, where W={w ⁽¹⁾ ,w ⁽²⁾ } refers to the weight in the above neural network;

将S31中获取的训练数据，输入该神经网络优化代价函数；借助梯度下降法完成对于该交叉熵代价函数的最小化，得到的优化完成的神经网络，进而得到驾驶策略获取器。Input the training data obtained in S31 into the neural network optimization cost function; use the gradient descent method to minimize the cross-entropy cost function, obtain the optimized neural network, and then obtain the driving strategy acquirer.

作为更进一步地，判定器具体实现过程包括：As a further step, the specific implementation process of the determiner includes:

将当前的回报函数生成器和驾驶策略获取器看做一个整体，查看当前S22中的t值，是否满足t＜ε，ε为评判目标函数是否满足需求的阈值，也就是判断当前用于获取驾驶策略的回报函数是否满足要求；其数值根据具体需要进行不同的设置；Consider the current reward function generator and driving strategy acquirer as a whole, and check the current t value in S22 to see if t<ε, ε is the threshold for judging whether the objective function satisfies the requirement, that is, to judge whether it is currently used to obtain the driving strategy Whether the return function of the strategy meets the requirements; its value can be set differently according to specific needs;

当t的数值，不满足该公式时；需要重新构建回报函数生成器，此时需要将当前S23中需要的神经网络替换成S33中已经经过优化过后的新的神经网络，即将用于生成描述在驾驶场景状态s_t下，选取的决策驾驶动作a_i优劣的Q(s_t,a_i)值的网络，替换成S33中经过梯度下降方法进行优化过的新的网络结构；然后重新构建回报函数生成器、得到驾驶策略获取器，再次判断t的数值是否满足需求；When the value of t does not satisfy the formula; the return function generator needs to be rebuilt. At this time, the neural network required in S23 needs to be replaced with the optimized new neural network in S33, which will be used to generate the description in Under the driving scene state s _t , the selected network of Q(st _t , a _i ) value for decision-making of the driving action a _i is replaced with the new network structure optimized by the gradient descent method in S33; and then rebuild the reward The function generator obtains the driving strategy acquirer, and judges whether the value of t meets the requirement again;

当满足该公式时，当前的θ就是所需的回报函数的权值；回报函数生成器则满足要求，驾驶策略获取器也满足要求；于是采集需要建立驾驶员模型的某驾驶员的驾驶数据，即驾驶过程中的环境场景图像和对应的操作数据，输入驾驶环境特征提取器，得到对于当前场景的决策特征；然后将提取得到的特征输入回报函数生成器，得到对应场景状态的回报函数；然后把采集的决策特征和计算得到的回报函数输入驾驶策略获取器，得到该驾驶员对应的驾驶策略。When the formula is satisfied, the current θ is the weight of the required reward function; the reward function generator meets the requirements, and the driving strategy acquirer also meets the requirements; then the driving data of a driver who needs to build a driver model is collected, That is, the environment scene image and the corresponding operation data in the driving process are input into the driving environment feature extractor to obtain the decision-making features for the current scene; then the extracted features are input into the reward function generator to obtain the reward function corresponding to the state of the scene; then Input the collected decision features and the calculated reward function into the driving strategy acquirer to obtain the corresponding driving strategy of the driver.

本发明与现有技术相比有益效果在于：本发明中用于描述驾驶员决策，建立驾驶员行为模型的方法，因采用神经网络来描述策略，在神经网络参数确定的时候，状态和动作一一对应，于是对于状态-动作对可能的情况不再受限于示范轨迹。于是在实际的驾驶情况中，因为天气、景物等原因导致的多样的驾驶场景对应的大状态空间，借助于神经网络优秀的近似表达任意函数的能力，近似的可将这一种策略表达看作黑箱：通过输入状态的特征值，输出对应的状态-动作值，同时进一步的根据输出值的情况来选取动作，以获得对应动作。从而使借助逆向强化学习来对于驾驶员行为建模的适用性大大增强，传统方法因试图借助某一概率分布来拟合到示范轨迹，因而获得的最优策略依旧受限于示范轨迹中的已有的状态情况，而本发明可以对于新的状态场景进行适用，来获得其对应动作，大大提高了建立的驾驶员行为模型的泛化能力，适用场景更广，鲁棒性更强。Compared with the prior art, the present invention has the beneficial effects that: the present invention is used to describe the driver's decision-making and the method for establishing the driver's behavior model. Because the neural network is used to describe the strategy, when the neural network parameters are determined, the state and action One-to-one correspondence, then the possible cases for state-action pairs are no longer limited to the exemplary trajectories. Therefore, in the actual driving situation, the large state space corresponding to various driving scenarios caused by weather, scenery, etc., with the help of the excellent ability of the neural network to approximate any function, this kind of policy expression can be approximated as Black box: By inputting the characteristic value of the state, the corresponding state-action value is output, and at the same time, the action is further selected according to the output value to obtain the corresponding action. As a result, the applicability of inverse reinforcement learning for driver behavior modeling is greatly enhanced. The traditional method attempts to fit the demonstration trajectory with the help of a certain probability distribution, so the optimal strategy obtained is still limited by the existing parameters in the demonstration trajectory. However, the present invention can be applied to new state scenarios to obtain corresponding actions, which greatly improves the generalization ability of the established driver behavior model, and has wider applicable scenarios and stronger robustness.

附图说明Description of drawings

图1为新的深度卷积神经网络；Figure 1 is a new deep convolutional neural network;

图2为驾驶视频采样图；Fig. 2 is a driving video sampling diagram;

图3为本系统工作流程框图；Figure 3 is a block diagram of the workflow of the system;

图4为步骤S32中建立神经网络结构图。FIG. 4 is a structural diagram of the neural network established in step S32.

具体实施方式Detailed ways

下面将结合说明书附图，对本发明作进一步说明。以下实施例仅用于更加清楚地说明本发明的技术方案，而不能以此来限制本发明的保护范围。The present invention will be further described below in conjunction with the accompanying drawings. The following examples are only used to illustrate the technical solution of the present invention more clearly, but not to limit the protection scope of the present invention.

本实施例提供一种驾驶员行为建模系统，包括：This embodiment provides a driver behavior modeling system, including:

1.特征提取器，提取构建回报函数特征，具体方式是：1. The feature extractor extracts and constructs the features of the return function, the specific method is:

S11.对车辆行驶过程中，利用放在车的挡风玻璃后面的摄像机得到的驾驶视频进行采样，采样图如图2所示。S11. Sampling the driving video obtained by using the camera placed behind the windshield of the vehicle during the driving process of the vehicle, as shown in FIG. 2 .

获得N组不同车辆驾驶道路环境道路情况的图片和对应的转向角情况。包括N1张直道和N2张弯道，N1、N2的取值可以为N1>＝300,N2>＝3000，同时对应驾驶操作数据，联合构建起来训练数据。Obtain pictures of N groups of different vehicle driving road environment road conditions and corresponding steering angle conditions. Including N1 straight roads and N2 curves, the values of N1 and N2 can be N1>=300, N2>=3000, corresponding to the driving operation data, and jointly construct training data.

S12.对收集来的图像进行相关的平移、裁剪、变更亮度等操作，以模拟不同光照和天气的场景。S12. Perform relevant operations such as translation, cropping, and brightness change on the collected images to simulate scenes with different lighting and weather.

S13.构建卷积神经网络，将经过处理后的图片作为输入，对应图片的操作数据作为标签值，进行训练；采用基于Nadam优化器的优化方法来对均方误差损失求最优解来优化神经网络的权重参数。S13. Construct a convolutional neural network, use the processed picture as input, and use the operation data of the corresponding picture as the label value for training; use the optimization method based on the Nadam optimizer to find the optimal solution for the mean square error loss to optimize the neural network The weight parameters of the network.

卷积神经网络包括1个输入层、3个卷积层、3个池化层、4个全连接层。输入层依次连接第一个卷积层、第一个池化层，然后连接第二个卷积层、第二个池化层，再连接第三个卷积层、第三个池化层，然后依次连接第一个全连接层、第二个全连接层、第三个全连接层、第四个全连接层。The convolutional neural network includes 1 input layer, 3 convolutional layers, 3 pooling layers, and 4 fully connected layers. The input layer is connected to the first convolutional layer and the first pooling layer in turn, then connected to the second convolutional layer, the second pooling layer, and then connected to the third convolutional layer and the third pooling layer, Then connect the first fully connected layer, the second fully connected layer, the third fully connected layer, and the fourth fully connected layer in sequence.

S14.将训练完成后的卷积神经网络的除最后输出层之外的网络结构和权值保存，以建立新的一个卷积神经网络，完成状态特征提取器。S14. Save the network structure and weights of the trained convolutional neural network except for the final output layer to establish a new convolutional neural network and complete the state feature extractor.

2.回报函数生成器，获取驾驶策略，具体方式是：2. The reward function generator obtains the driving strategy, the specific method is:

回报函数作为强化学习方法中动作选取的标准，在驾驶策略的获取过程，回报函数的好坏起到了决定性的作用，其直接决定了获取的驾驶策略的优劣，以及获得的策略是否和真实的驾驶示范数据对应的策略相同。回报函数的公式为reward＝θ^Tf(s_t,a_t)，f(s_t,a_t)代指对应驾驶环境场景「车辆周围环境」下的t时刻状态s_t的一组影响驾驶决策结果的特征值，用于描述车辆周围环境场景情况。而θ代指对应影响驾驶决策的特征的一组权值，权值的数值说明了对应的环境特征在回报函数中所占的比重，体现了重要性。在状态特征提取器的基础上，需要求解这一权值θ，从而来构建影响驾驶策略的回报函数。The reward function is used as the standard for action selection in the reinforcement learning method. In the process of obtaining the driving strategy, the quality of the reward function plays a decisive role. It directly determines the quality of the obtained driving strategy and whether the obtained strategy is consistent with the real one. The strategy corresponding to the driving demonstration data is the same. The formula of the reward function is reward=θ ^T f(s _t , a _t ), f(s _t , a _t ) refers to a group of driving decisions that affect the state s _t at time t in the corresponding driving environment scene "vehicle surrounding environment" The eigenvalues of the result are used to describe the scene situation around the vehicle. And θ refers to a set of weights corresponding to the features that affect driving decisions. The value of the weights shows the proportion of the corresponding environmental features in the reward function, reflecting the importance. On the basis of the state feature extractor, this weight θ needs to be solved to construct the reward function that affects the driving strategy.

S21.获得专家的驾驶示范数据S21. Obtain driving demonstration data from experts

驾驶示范数据来自对于示范驾驶视频数据的采样提取(和之前驾驶环境特征提取器所用数据不同)，可以按照10hz的频率对一段连续的驾驶视频进行采样，得到一组轨迹示范。一个专家示范应具有多条轨迹。总体记做：其中D_E表示整体的驾驶示范数据，(s_j,a_j)表示对应状态j(采样的时间j的驾驶环境的视频图片)和该状态对应决策指令(如转向指令中的转向角度)构成的数据对，M代表总共的驾驶示范数据的个数，N_T代表驾驶示范轨迹数目，L_i代表第i条驾驶示范轨迹中包含的状态-决策指令对(s_j,a_j)的个数The driving demonstration data comes from the sampling and extraction of demonstration driving video data (different from the data used by the previous driving environment feature extractor). A continuous driving video can be sampled at a frequency of 10hz to obtain a set of trajectory demonstrations. An expert demonstration should have multiple trajectories. Overall remember to do: Where D _E represents the overall driving demonstration data, (s _j , a _j ) represents the corresponding state j (the video picture of the driving environment at the sampling time j) and the state corresponding to the decision instruction (such as the steering angle in the steering instruction) Data pair, M represents the total number of driving demonstration data, _NT represents the number of driving demonstration trajectories, L _i represents the number of state-decision instruction pairs (s _j , a _j ) contained in the i-th driving demonstration trajectory

S22.求取驾驶示范的特征期望S22. Obtaining the characteristic expectation of driving demonstration

首先将驾驶示范数据D_E中的各个描述驾驶环境情况的状态s_t输入状态特征提取器，获得对应状态s_t下的特征情况f(s_t,a_t)，f(s_t,a_t)代指一组对应s_t的影响驾驶决策结果的驾驶环境场景的特征值，然后基于下述公式计算出来驾驶示范的特征期望：Firstly, each state s _t describing the driving environment in the driving demonstration data DE is input into the state feature extractor to obtain the characteristic situation _f (s _t , a _t ) in the corresponding state s _t , f(s _t , a _t ) Refers to a set of eigenvalues of driving environment scenes that affect driving decision-making results corresponding to s _t , and then calculate the characteristic expectation of driving demonstration based on the following formula:

其中γ为折扣因子，根据问题的不同，对应进行设置，参考数值可设为0.65。Among them, γ is the discount factor, which should be set according to different problems, and the reference value can be set to 0.65.

S23.求取贪婪策略下的状态-动作集S23. Obtain the state-action set under the greedy strategy

首先，获取S32中的驾驶策略获取器中的神经网络。(因回报函数生成器和驾驶策略获取器是一个循环中的两部分，最开始的时候神经网络是S32中刚刚初始化的神经网络。随着循环的进行，循环中的每一步都是：在完成一次影响驾驶决策的回报函数的构建，然后基于当前回报函数在驾驶策略获取器中获取对应的最优驾驶策略，判断是否满足结束循环的标准，若不满足，则将当前S34中的经过优化过的神经网络放入重新构建回报函数中)First, acquire the neural network in the driving strategy acquirer in S32. (Because the reward function generator and the driving strategy acquirer are two parts in a cycle, the neural network is the neural network just initialized in S32 at the beginning. As the cycle progresses, each step in the cycle is: Construction of a reward function that affects driving decision-making, and then obtain the corresponding optimal driving strategy in the driving strategy obtainer based on the current reward function, and judge whether the criteria for ending the cycle are met. If not, the optimized driving strategy in the current S34 put the neural network into the reconstructed reward function)

把驾驶示范数据D_E提取得到的描述环境情况的状态特征f(s_t,a_t)，输入神经网络，得到输出g_w(s_t)；g_w(s_t)是关于描述状态s_t的一组Q值集合，即[Q(s_t,a₁),...,Q(s_t,a_n)]^T，而Q(s_t,a_i)代表状态-动作值，用于描述在当前驾驶场景状态s_t下，选取决策驾驶动作a_i的优劣，可基于公式Q(s,a)＝θ·μ(s,a)进行求得，该公式中的θ代指的当前回报函数中的权值，μ(s,a)代指特征期望。The state feature _f (s _t , a _t ) describing the environment obtained by extracting the driving demonstration data DE is input into the neural network to obtain the output g _w (s _t ); g _w ( _st ) is about describing the state s _t A set of Q values, that is, [Q( _st ,a ₁ ),...,Q( _st ,a _n )] ^T , and Q(st _t ,a _i ) represents the state-action value and is used to describe In the current driving scene state s _t , the quality of the decision-making driving action a _i can be selected based on the formula Q(s,a)=θ·μ(s,a), where θ refers to the current The weight in the reward function, μ(s,a) refers to the feature expectation.

然后基于ε-greedy策略，假如设置ε为0.5，进行选取描述驾驶场景状态s_t对应的驾驶决策动作也就是说有百分之五十的可能性，选取关于当前驾驶场景s_t下的Q值集合中让Q值最大的决策动作否则，则随机选取选取完之后，记录此时的 Then based on the ε-greedy strategy, if ε is set to 0.5, select the driving decision-making action corresponding to the description driving scene state s _t That is to say, there is a 50% possibility to select the decision-making action that maximizes the Q value in the set of Q values under the current driving scene s _t Otherwise, randomly select selected Afterwards, record the

如此对于驾驶示范D_E中的每个状态的状态特征f(s_t,a_t)，输入该神经网络，共获取得到M个状态-动作对(s_t,a_t)其描述了t时刻的驾驶场景状态s_t下选取驾驶决策动作a_t。同时基于动作选取的情况，获取了M个对应状态-动作对的Q值，记做Q。In this way, for the state feature _f (s _t , a _t ) of each state in the driving demonstration DE, input the neural network, and obtain a total of M state-action pairs (s _t , a _t ), which describe the Select the driving decision-making action a _{t in the driving scene state s t} _. At the same time, based on the situation of action selection, the Q values of M corresponding state-action pairs are obtained, denoted as Q.

S24.求取回报函数的权值S24. Find the weight of the return function

代表着损失函数，即依据当前状态-动作对是否存在于驾驶示范之中，若存在则为0，否则为1。为上面所记录的对应状态-动作值。为S22中求取的驾驶示范特征期望和回报函数的权值θ的乘积。为正则项，以防过拟合问题的出现，该γ可以为0.9。 Represents the loss function, that is, according to whether the current state-action pair exists in the driving demonstration, if it exists, it is 0, otherwise it is 1. is the corresponding state-action value recorded above. is the product of the driving demonstration feature expectation obtained in S22 and the weight θ of the reward function. As a regular term, in case of over-fitting problems, the γ can be 0.9.

S25.基于获得的对应回报函数权值θ，根据公式r(s,a)＝θ^Tf(s,a)构建回报函数生成器。S25. Based on the obtained corresponding reward function weight θ, construct a reward function generator according to the formula r(s, a)=θ ^T f(s, a).

3.驾驶策略获取器，完成驾驶策略的构建，具体方式是：3. The driving strategy acquirer completes the construction of the driving strategy. The specific method is:

S31驾驶策略获取器的训练数据的构建Construction of training data for S31 driving strategy acquirer

获取训练数据。数据来自于对之前的示范数据的采样，但需要进行处理得到一组新的类型的数据共计N个。数据中每个数据包括两部分：一个是将t时刻驾驶场景状态输入驾驶状态提取器得到的驾驶决策特征f(s_t)，另一个就是基于下面公式得到的 Get training data. The data comes from the sampling of the previous demonstration data, but it needs to be processed to obtain a new set of data of N types. Each data in the data includes two parts: one is the driving decision feature f(s _t ) obtained by inputting the driving scene state at time t into the driving state extractor, and the other is obtained based on the following formula

该公式中包含参数r_θ(s_t,a_t)借助回报函数生成器基于驾驶示范数据生成的回报函数。Q^π(s_t,a_t)和Q^π(s_t+1,a_t+1)来自于S23中所记录的那组Q值Q，选取其中描述t时刻驾驶场景s_t的Q值和选取其中描述t+1时刻驾驶场景s_t+1的Q值。The formula includes a reward function generated by the parameter _r _θ (s _t ,at ) based on the driving demonstration data by means of a reward function generator. Q ^π (s _t , a _t ) and Q ^π (s _t ₊₁ , a _t+1 ) come from the set of Q values Q recorded in S23, select the Q value and select It describes the Q value of driving scene s _t+ 1 at time t+1.

S32.建立神经网络S32. Establish neural network

神经网络包括三层，第一层作为输入层，其中的神经元个数和特征提取器的输出特征种类相同为k个，用于输入驾驶场景的特征f(s_t,a_t)，第二层的隐层个数为10个，第三层的神经元个数和动作空间中进行决策的驾驶动作个数n一样的个数；输入层和隐层的激活函数都为sigmoid函数，即即有着：The neural network consists of three layers, the first layer is used as the input layer, and the number of neurons in it is the same as the output feature type of the feature extractor, which is k, which is used to input the feature f(s _t , a _t ) of the driving scene, and the second layer The number of hidden layers in the first layer is 10, and the number of neurons in the third layer is the same as the number n of driving actions for decision-making in the action space; the activation functions of the input layer and the hidden layer are both sigmoid functions, that is i.e. have:

z＝w⁽¹⁾x＝w⁽¹⁾[1,f_t]^T z=w ⁽¹⁾ x=w ⁽¹⁾ [1,f _t ] ^T

h＝sigmoid(z)h=sigmoid(z)

g_w(s_t)＝sigmoid(w⁽²⁾[1,h]^T)g _w (s _t )=sigmoid(w ⁽²⁾ [1,h] ^T )

其中w⁽¹⁾代指隐层的权值；f_t代指t时刻驾驶场景的状态s_t的特征，也就是神经网络的输入；z代指未经过隐层sigmoid激活函数时候的网络层的输出；h代指经过sigmoid激活函数后的隐层输出；w⁽²⁾代指输出层的权值；网络结构如图3：Among them, w ⁽¹⁾ refers to the weight of the hidden layer; f _t refers to the characteristics of the state s _t of the driving scene at time t, which is the input of the neural network; z refers to the network layer without the sigmoid activation function of the hidden layer Output; h refers to the hidden layer output after the sigmoid activation function; w ⁽²⁾ refers to the weight of the output layer; the network structure is shown in Figure 3:

网络输出的g_w(s_t)是t时刻驾驶场景状态s_t的Q集合，即[Q(s_t,a₁),...,Q(s_t,a_n)]^T，S31中的Q^π(s_t,a_t)就是将状态s_t输入神经网络，选择输出中的a_t项所得到。The g _w (s _t ) output by the network is the Q set of the driving scene state s _t at time t, that is, [Q( _st ,a ₁ ),...,Q(s _t ,a _n )] ^T , in S31 Q ^π (s _t , at ) _{is obtained by inputting the state s t} _into _the neural network and selecting the at item in the output.

S33.优化神经网络S33. Optimizing the neural network

其中N代指训练数据的个数。Q^π(s_t,a_t)就是将描述t时刻驾驶场景状态s_t输入神经网络，选择输出中的对应驾驶决策动作a_t项所得到的数值。为S31中求得的数值。同样是正则项，防止过拟合而设置的。该γ也可以为0.9。其中的W＝{w⁽¹⁾,w⁽²⁾}代指上面神经网络中权值。Where N refers to the number of training data. Q ^π (s _t , a _t ) is the value obtained by inputting the description of the driving scene state s _{t at time t} into the neural network, and selecting the corresponding driving decision-making action a _t item in the output. is the value obtained in S31. It is also a regular item, which is set to prevent overfitting. This γ may also be 0.9. Among them, W={w ⁽¹⁾ ,w ⁽²⁾ } refers to the weight value in the above neural network.

将S31中获取的训练数据，输入该神经网络优化代价函数。借助梯度下降法完成对于该交叉熵代价函数的最小化，得到的优化完成的神经网络，得到驾驶策略获取器。Input the training data obtained in S31 into the neural network optimization cost function. The gradient descent method is used to minimize the cross-entropy cost function, and the optimized neural network is obtained to obtain the driving strategy acquirer.

4.判定器，判断获取器构建的最优驾驶策略，其是否满足评判标准；若不满足，则重新构建回报函数，重复构建最优驾驶策略，反复迭代，直到满足评判标准；最终获得描述真实驾驶示范的驾驶策略。4. The determiner judges whether the optimal driving strategy constructed by the acquirer meets the evaluation criteria; if not, rebuilds the reward function, repeats the construction of the optimal driving strategy, and iterates repeatedly until the evaluation criteria are met; finally obtains the true description Driving strategies for driving demonstrations.

将当前的回报函数生成器和驾驶策略获取器看做一个整体，查看当前S22中的t值，是否满足t＜ε，ε为评判目标函数是否满足需求的阈值，也就是判断当前用于获取驾驶策略的回报函数是否满足要求。其数值根据具体需要进行不同的设置。Consider the current reward function generator and driving strategy acquirer as a whole, and check the current t value in S22 to see if t<ε, ε is the threshold for judging whether the objective function satisfies the requirement, that is, to judge whether it is currently used to obtain the driving strategy Whether the reward function of the strategy meets the requirements. Its value can be set differently according to specific needs.

当t的数值，不满足该公式的时候。需要重新构建回报函数生成器，此时需要将当前S23中需要的神经网络替换成S33中已经经过优化过后的新的神经网络，即将用于生成描述在驾驶场景状态s_t下，选取的决策驾驶动作a_i优劣的Q(s_t,a_i)值的网络，替换成S33中经过梯度下降方法进行优化过的新的网络结构。然后重新构建回报函数生成器、得到驾驶策略获取器，再次判断t的数值是否满足需求。When the value of t does not satisfy the formula. It is necessary to rebuild the reward function generator. At this time, the neural network required in the current S23 needs to be replaced with the optimized new neural network in S33, which will be used to generate and describe the selected driving decision in the driving scene state s _t The network of Q(st _t , a _i ) value of action a _i is replaced by the new network structure optimized by the gradient descent method in S33. Then rebuild the reward function generator, obtain the driving strategy acquirer, and judge whether the value of t meets the requirement again.

当满足该公式的时候，当前的θ就是所需的回报函数的权值。回报函数生成器则满足要求，驾驶策略获取器也满足要求。于是可以：采集需要建立驾驶员模型的某驾驶员的驾驶数据，即驾驶过程中的环境场景图像和对应的操作数据，如驾驶转向角。输入驾驶环境特征提取器，得到对于当前场景的决策特征。然后将提取得到的特征输入回报函数生成器，得到对应场景状态的回报函数。然后把采集的决策特征和计算得到的回报函数输入驾驶策略获取器，得到该驾驶员对应的驾驶策略。When the formula is satisfied, the current θ is the weight of the required reward function. The reward function generator meets the requirements, and the driving strategy acquirer also meets the requirements. Therefore, it is possible to: collect driving data of a driver who needs to establish a driver model, that is, an image of an environment scene during driving and corresponding operation data, such as a driving steering angle. Input the driving environment feature extractor to obtain the decision features for the current scene. Then input the extracted features into the reward function generator to obtain the reward function corresponding to the scene state. Then input the collected decision features and the calculated reward function into the driving strategy acquirer to obtain the corresponding driving strategy of the driver.

在马尔科夫决策过程中，一种策略需要连接状态到其对应的动作。但对于一个有着大范围的状态空间的时候，对于未遍历的区域，很难描述出来一个确定的策略表示，传统方法之中也忽略了对这部分的描述，仅仅是基于示范轨迹，来说明整个轨迹分布的概率模型，对于新的状态并没有给出具体的策略表示，即对于新状态采取确定动作的可能性并未给出具体的方法。本发明中借助神经网络对于策略进行描述，神经网络因其能够在任何精确度上近似表示任意函数的特性，同时有着优秀的泛化能力。借助状态特征的表示，一方面可以表示出那些不包含在示范轨迹中的状态，另外，借助将对应的状态特征输入神经网络。可以求取对应的动作值，从而依策略求取应得的动作，因而，传统方法无法泛化驾驶示范数据到未遍历驾驶场景状态问题得以解决。In a Markov decision process, a policy needs to connect states to their corresponding actions. But for a state space with a large range, it is difficult to describe a definite strategy representation for the untraversed area, and the description of this part is also ignored in the traditional method, which is only based on the demonstration trajectory to illustrate the whole The probability model of the trajectory distribution does not give a specific policy representation for the new state, that is, it does not give a specific method for the possibility of taking certain actions in the new state. In the present invention, the policy is described by means of a neural network. The neural network has excellent generalization ability because it can approximate the characteristics of any function with any accuracy. With the help of the representation of state features, on the one hand, those states which are not included in the exemplary trajectory can be represented, and on the other hand, by means of inputting the corresponding state features into the neural network. The corresponding action value can be obtained, so as to obtain the desired action according to the strategy. Therefore, the problem that the traditional method cannot generalize the driving demonstration data to the untraversed driving scene state is solved.

以上所述，仅为本发明创造较佳的具体实施方式，但本发明创造的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明创造披露的技术范围内，根据本发明创造的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明创造的保护范围之内。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto, any person familiar with the technical field within the technical scope of the disclosure of the present invention, according to the present invention Any equivalent replacement or change of the created technical solution and its inventive concept shall be covered within the scope of protection of the present invention.

Claims

1. The driver behavior modeling system is characterized in that it specifically includes:

Feature extractor, extracting and constructing reward function features;

Reward function generator to obtain driving strategy;

The driving strategy acquirer completes the construction of the driving strategy;

The determiner judges whether the optimal driving strategy constructed by the obtainer meets the evaluation criteria; if not, rebuilds the reward function, repeats the construction of the optimal driving strategy, and iterates repeatedly until the evaluation criteria are met.

2. according to the described driver's behavior modeling system of claim 1, it is characterized in that, the concrete realization process that feature extractor extracts and constructs reward function feature is:

S11. During the driving process of the vehicle, use the camera placed behind the windshield of the vehicle to sample the driving video, and obtain N groups of pictures of different vehicle driving environment road conditions and the corresponding steering angle conditions; at the same time, corresponding to the driving operation data, jointly construct Get up the training data;

S12. Translating, cropping, and changing brightness operations on the collected pictures to simulate scenes with different lighting and weather;

S13. Construct a convolutional neural network, use the processed picture as input, and use the operation data of the corresponding picture as the label value for training, and use the optimization method based on the Nadam optimizer to find the optimal solution for the mean square error loss to optimize the neural network. The weight parameter;

S14. Save the network structure and weights of the trained convolutional neural network to build a new convolutional neural network and complete the state feature extractor.

3. The driver behavior modeling system according to claim 2, wherein the convolutional neural network set up in step S13 comprises 1 input layer, 3 convolutional layers, 3 pooling layers, 4 fully connected layer; the input layer is connected to the first convolutional layer and the first pooling layer in turn, then connected to the second convolutional layer, the second pooling layer, and then connected to the third convolutional layer and the third pooling layer Layers, and finally connect the first fully connected layer, the second fully connected layer, the third fully connected layer, and the fourth fully connected layer.

4. The driver behavior modeling system according to claim 2, wherein the convolutional neural network after the training in step S14 does not include an output layer.

5. according to the described driver's behavior modeling system of claim 1, it is characterized in that, reward function generator obtains driving strategy concrete implementation process is:

S21. Obtaining expert driving demonstration data: the driving demonstration data comes from the sampling and extraction of demonstration driving video data, and a continuous driving video is sampled according to a certain frequency to obtain a set of trajectory demonstration; one expert demonstration data includes multiple trajectories, and the overall Remember to do:

D _E ＝{(s ₁ ,a ₁ ),(s ₂ ,a ₂ ),...,(s _M ,a _M )} Among them, D _E represents the overall driving demonstration data, (s _j , a _j ) represents the data pair consisting of the corresponding state j and the corresponding decision instruction of the state, M represents the total number of driving demonstration data, _NT represents the number of driving demonstration trajectories , L _i represents the number of state-decision instruction pairs (s _j , a _j ) contained in the i-th driving demonstration trajectory;

S22. Obtaining the characteristic expectation value of the driving demonstration;

Firstly, each state s _t describing the driving environment in the driving demonstration data DE is input into the state feature extractor to obtain the characteristic situation _f (s _t , a _t ) in the corresponding state s _t , f(s _t , a _t ) refers to a set of characteristic values of the driving environment scene corresponding to s _t that affect the driving decision-making result, and then calculate the characteristic expected value of the driving demonstration based on the following formula:

Among them, γ is the discount factor, which should be set according to different problems;

S23. Obtain the state-action set under the greedy strategy;

S24. Calculate the weight of the reward function.

6. according to the described driver behavior modeling system of claim 5, it is characterized in that, the specific step of obtaining the state-action set under the greedy strategy is: because the reward function generator and the driving strategy acquirer are two parts of the cycle; First, obtain the neural network in the driving strategy acquirer: extract the state feature _f (s _t , a _t ) describing the environmental situation obtained by extracting the driving demonstration data DE, input it into the neural network, and obtain the output g _w (s _t ); g _w (s _t ) is a set of Q values describing the state s _t , that is, [Q(s _t ,a ₁ ),...,Q(s _t ,a _n )] ^T , and Q(s _t , a _i ) represents the state-action value, which is used to describe the pros and cons of selecting a decision-making driving action a _i under the current driving scene state s _t , and calculate it based on the formula Q(s,a)=θ·μ(s,a) So, θ in this formula refers to the weight in the current reward function, and μ(s, a) refers to the expected value of the feature.

Then based on the ε-greedy strategy, select and describe the driving decision-making action corresponding to the driving scene state s _t Select the decision-making action that maximizes the Q value in the Q value set under the current driving scene s _t Otherwise, randomly select selected Afterwards, record the

In this way, for the state feature _f (s _t , a _t ) of each state in the driving demonstration DE, input the neural network, and obtain a total of M state-action pairs (s _t , a _t ), which describe the time t The driving decision-making action a _t is selected under the driving scene state s _t ; at the same time, based on the action selection situation, the Q values of M corresponding state-action pairs are obtained, denoted as Q.

7. according to the described driver's behavior modeling system of claim 5, it is characterized in that, the concrete step of seeking the weight value of return function is: first based on following formula, construct objective function:

Represents the loss function, that is, according to whether the current state-action pair exists in the driving demonstration, if it exists, it is 0, otherwise it is 1; is the corresponding state-action value recorded above; Be the product of the weight θ of the driving demonstration feature expectation and reward function obtained in S22; is a regular item;

The objective function is minimized by means of the gradient descent method, that is, t=min _θ J(θ), and the variable θ that minimizes the objective function is obtained, and the θ is the weight of the required reward function.

8. According to the described driver behavior modeling system of claim 5, it is characterized in that, the specific implementation process of the reward function generator obtaining the driving strategy also includes: S25. Based on the corresponding reward function weight θ obtained, according to the formula r(s, a)=θ ^T f(s,a) builds a reward function generator.

9. according to the described driver behavior modeling system of claim 1, it is characterized in that, the specific realization process that driving strategy acquirer completes driving strategy construction is:

S31 Construct the training data of the driving strategy acquirer

Obtain training data, each data includes two parts: one is the driving decision feature f(s _t ) obtained by inputting the driving scene state at time t into the driving state extractor, and the other is obtained based on the following formula

Among them, r _θ (s _t , a _t ) is the reward function generated based on the driving demonstration data by means of the reward function generator; Q ^π (s _t , a _t ) and Q ^π (s _t+1 , a _t+1 ) come from For the Q value recorded in S23, select the Q value that describes the driving scene s _t at time t and select the Q value that describes the driving scene s _t+ 1 at time t+1;

S32. Establish neural network

The neural network consists of three layers, the first layer is used as the input layer, and the number of neurons in it is the same as the output feature type of the feature extractor, which is k, which is used to input the feature f(s _t , a _t ) of the driving scene, and the second layer The number of hidden layers in the first layer is 10, and the number of neurons in the third layer is the same as the number n of driving actions for decision-making in the action space; the activation functions of the input layer and the hidden layer are both sigmoid functions, that is i.e. have:

z=w ⁽¹⁾ x=w ⁽¹⁾ [1,f _t ] ^T

h=sigmoid(z)

g _w (s _t )=sigmoid(w ⁽²⁾ [1,h] ^T )

where w ⁽¹⁾ is the weight of the hidden layer; f _t is the feature of the state s _t of the driving scene at time t, that is, the input of the neural network; z is the output of the network layer without the sigmoid activation function of the hidden layer; h is Hidden layer output after sigmoid activation function; w ⁽²⁾ is the weight of the output layer;

The g _w (s _t ) output by the network is the Q set of the driving scene state s _t at time t, that is, [Q( _st ,a ₁ ),...,Q(s _t ,a _n )] ^T , in S31 Q ^π (s _t , a _t ) is obtained by inputting the state s _t into the neural network and selecting the item a _t in the output;

S33. Optimizing the neural network

For the optimization of the neural network, the established loss function is the cross-entropy cost function, and the formula is as follows:

Among them, N refers to the number of training data; Q ^π ( _st , a _t ) is the value obtained by inputting the driving scene state s _t describing the driving scene at time t into the neural network, and selecting the corresponding driving decision-making action a _t item in the output; is the value obtained in S31; is a regular term, where W={w ⁽¹⁾ ,w ⁽²⁾ } refers to the weight in the above neural network;

Input the training data obtained in S31 into the neural network optimization cost function; use the gradient descent method to minimize the cross-entropy cost function, obtain the optimized neural network, and then obtain the driving strategy acquirer.

10. The driver behavior modeling system according to claim 1, wherein the specific implementation process of the determiner comprises:

Consider the current reward function generator and driving strategy acquirer as a whole, and check the current t value in S22 to see if t<ε, ε is the threshold for judging whether the objective function satisfies the requirement, that is, to judge whether it is currently used to obtain the driving strategy Whether the return function of the strategy meets the requirements; its value can be set differently according to specific needs;

When the value of t does not satisfy the formula; the return function generator needs to be rebuilt. At this time, the neural network required in S23 needs to be replaced with the optimized new neural network in S33, which will be used to generate the description in Under the driving scene state s _t , the selected network of Q(st _t , a _i ) value for decision-making of the driving action a _i is replaced with the new network structure optimized by the gradient descent method in S33; and then rebuild the reward The function generator obtains the driving strategy acquirer, and judges whether the value of t meets the requirement again;

When the formula is satisfied, the current θ is the weight of the required reward function; the reward function generator meets the requirements, and the driving strategy acquirer also meets the requirements; then the driving data of a driver who needs to build a driver model is collected, That is, the environment scene image and the corresponding operation data in the driving process are input into the driving environment feature extractor to obtain the decision-making features for the current scene; then the extracted features are input into the reward function generator to obtain the reward function corresponding to the state of the scene; then Input the collected decision features and the calculated reward function into the driving strategy acquirer to obtain the corresponding driving strategy of the driver.