CN115782880A

CN115782880A - Intelligent automobile lane change decision-making method and device, electronic equipment and storage medium

Info

Publication number: CN115782880A
Application number: CN202211508728.6A
Authority: CN
Inventors: 黄开胜; 林立; 袁宏
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-14

Abstract

The embodiment of the disclosure provides an intelligent automobile lane change decision method, an intelligent automobile lane change decision device, an electronic device and a storage medium, wherein the method comprises the following steps: constructing a virtual lane change scene by using a simulation method; a decision model is constructed for determining a mapping function between the state and the reward, and a Q table for evaluating the lane changing feasibility of the intelligent automobile is obtained based on the mapping function, so that a lane changing decision, namely a lane changing decision or a lane not changing decision is obtained; the method comprises the steps of enabling an intelligent automobile to generate random lane changing behaviors based on a greedy algorithm under a constructed virtual lane changing scene, storing states and rewards of the current lane changing behaviors into an experience playback pool, randomly sampling a plurality of state reward pairs from the experience playback pool when the number of the state reward pairs in the experience playback pool reaches the minimum sample number, carrying out single-step training on a decision-making model, storing new state reward pairs generated in the training process into the experience playback pool, and continuously repeating the training process until the maximum training times are reached to obtain the lane changing decision-making model. The lane change decision generated by the method is high in safety and wide in applicability.

Description

Intelligent automobile lane change decision-making method and device, electronic equipment and storage medium

Technical Field

The disclosure belongs to the technical field of automatic driving, and particularly relates to a lane change decision model generation method and device for an intelligent automobile and a storage medium.

Background

The automatic driving method mainly comprises environment perception, driving decision and vehicle control, wherein the lane change decision in the driving decision is one of the common transverse movement behaviors of the vehicle. The method mainly comprises two methods, namely a rule-based method and a machine learning method, wherein the latter method comprises two branches of supervised learning and reinforcement learning, and the essence is to establish a mapping relation from perception information to decision.

The rule-based method is mainly in the form of a Finite State Machine (FSM), relatively depends on rules and thresholds set by engineers, and has poor adaptability to complex traffic environments; in addition, the set rule is complex, and calibration parameters are numerous; and often the most core objective is safety, the strategy is often too conservative, and the traffic efficiency is seriously influenced.

The learning-based method is mainly applied to supervised learning and reinforcement learning in machine learning, and has strong adaptability. However, the former relies on the training of collecting a large amount of actual road data, so that the cost and the time consumption are high, and the safety problems of poor interpretability of a black box and the like exist; the later training of the intelligent agent in a simulation environment achieves similar effects, but the reinforcement learning concept needs to select specific technologies such as DDPG, DQN and the like according to the discreteness of input and output variables, key variables related to specific problems, application scenes, parameters and the like, the existing scheme for vehicle safety lane change decision is few, lane change of vehicles under different traffic phases is not considered, and the generalization capability of the decision under different traffic densities and speeds of lane change target lanes is poor.

Deep Reinforcement Learning (DRL) is a method of combining reinforcement learning and a Deep Neural Network (DNN), and a state-action reward Q function (value function) is fitted using the DNN so that it can adapt to continuous state variables (corresponding to vehicle peripheral information) as decision input. The difference between single-step reinforcement learning and general reinforcement learning lies in the difference of the calculated amount compared by each step of exploration strategy, the former selects the behavior with the maximum single-step feedback, and the latter selects the behavior with the maximum multi-step feedback. At present, the related technology of generating a lane change decision by single-step reinforcement learning is rarely used.

Disclosure of Invention

The present disclosure is directed to solving at least one of the technical problems of the prior art.

Therefore, the lane change decision method for the intelligent automobile provided by the embodiment of the first aspect of the disclosure has the advantages of high training efficiency, strong adaptability, good safety robustness, strong engineering practicability, strong interpretability of training parameter adjustment, high safety and the like.

The lane change decision method for the intelligent automobile provided by the embodiment of the first aspect of the disclosure comprises the following steps:

1) Constructing a virtual lane changing scene by using a simulation method, wherein the virtual lane changing scene comprises lanes, intelligent automobiles and randomly generated traffic flows, and the lanes at least comprise lanes where the intelligent automobiles are located and target lanes where the intelligent automobiles are expected to be changed;

2) A decision model based on a single-step deep Q neural network is constructed and used for determining a mapping function between a state and an award, and a Q table used for evaluating the lane changing feasibility of the intelligent automobile is obtained based on the mapping function so as to obtain a lane changing decision, namely a lane changing decision or a lane changing failure decision;

3) Enabling the intelligent automobile to generate random lane changing behaviors based on a greedy algorithm under a constructed virtual lane changing scene, recording the state and reward of the current lane changing behaviors to form a current state reward pair, storing the current state reward pair into an experience playback pool, randomly sampling a plurality of state reward pairs from the experience playback pool as a batch of training sample pair decision models to perform single-step training when the number of the state reward pairs in the experience playback pool reaches the minimum sample number, storing new state reward pairs generated in the training process into the experience playback pool, and continuously repeating the training process until the maximum training times is reached to obtain a lane changing decision model;

4) And acquiring the state of the real lane change scene of the intelligent automobile, and inputting the state into the lane change decision model to obtain the lane change decision of the intelligent automobile.

In some embodiments, in step 1), the randomly generated parameters of the traffic flow include: the average speed of the lane, the traffic flow density of the lane and the center distance between two adjacent vehicles in the lane.

In some embodiments, the maximum number of training sessions is determined according to the following steps:

adding different traffic flow working conditions into the constructed virtual lane changing scene, enabling the intelligent automobile to perform lane changing tests under different relative states between the intelligent automobile and vehicles in a target lane according to each working condition, counting the times of collision after the intelligent automobile performs lane changing actions, taking the times of lane changing tests when the times of collision are converged to be close to 0 as the total times of the lane changing tests of each working condition, and taking the maximum value of the total times of the lane changing tests of all the working conditions as the maximum training times.

In some embodiments, the maximum number of training sessions is determined as follows:

311 Add different traffic flow conditions to the constructed virtual lane-changing scene

According to the traffic flow theory, establishing the traffic flow working conditions of a free phase, a smooth phase, a forced phase and a blocking phase in a virtual lane changing scene, and respectively setting working condition parameters of each traffic phase, including the traffic flow speed, the traffic density and the central distance between a front vehicle and a rear vehicle in the same lane;

312 Respectively carrying out a plurality of lane change tests aiming at each traffic flow working condition, wherein each lane change test comprises the following steps:

3121 Setting an initial test time, wherein all vehicles are in a static state, the center distance between the adjacent front and rear vehicles in the target lane is the center distance set in the step 311), and setting an initial value of the distance between the current vehicle and the front vehicle in the current lane;

3122 Starting the vehicle in the target lane, and controlling the vehicle in the target lane to reach the traffic flow speed set in the step 311), wherein the vehicle in the target lane keeps the inter-vehicle distance unchanged in the whole process; after delaying a period of time, starting the vehicle, and controlling the vehicle to reach the traffic flow speed set in the step 311);

3123 When the speed of the self-vehicle reaches the traffic flow speed set in the step 311), the self-vehicle starts to decelerate by the front vehicle in the self-vehicle lane, when the speed of the front vehicle is reduced to a first speed value, the self-vehicle delays for a period of time and then reduces the speed to follow the speed of the front vehicle, when the speed of the front vehicle is reduced to a second speed value, the self-vehicle starts to judge the lane change feasibility, and after the lane change is judged to be possible, the lane change action is executed;

313 For each traffic flow working condition, counting the number of times of collision after a vehicle executes lane change action in the lane change tests every time the lane change tests are carried out for a plurality of times, taking the number of times of lane change tests when the number of times of collision is converged to be close to 0 as the total number of times of lane change tests of each working condition, and taking the maximum value of the total number of times of lane change tests of all the working conditions as the maximum training number.

In some embodiments, in step 3122) and step 3123), the delay time of the own vehicle is randomly set.

In some embodiments, step 3) specifically comprises the steps of:

31 Setting a cycle number as j, initializing j =1, initializing a network parameter theta of a decision model, constructing a Q table, initializing Q values in the Q table to be 0, and setting the maximum training frequency as M; constructing an experience playback pool D and initializing the experience playback pool D into an empty set, and setting the minimum sample number of the experience playback pool D as N _start Setting the number of randomly acquiring state reward pairs from the experience playback pool D every time as N; let the probability parameter of the greedy algorithm be ε _j And setting the initial value and the final value thereof as epsilon _start And ε _end Let the probability attenuation parameter of the greedy algorithm be epsilon _decay ；

32 If j is less than M, executing step 33), if j = M, ending training, and obtaining a lane change decision model;

33 Based on a greedy algorithm, randomly generating state reward pairs and storing the state reward pairs in an experience playback pool D, wherein the method specifically comprises the following steps:

331 In a virtual lane change scene, randomly initializing the state, initializing the lane change action of the own vehicle as a non-lane change, randomly generating a traffic flow simulating the driving behavior of a person in a target lane, and updating the state s _j Let the expression of the state be:

in the formula, v _e And

respectively the speed of the bicycle and the normalized threshold value thereof; v. of _l And

respectively the speed of the front vehicle in the target lane and the normalized threshold value thereof; v. of _f And

respectively the speed of the rear vehicle in the target lane and the normalized threshold value of the rear vehicle; d _l And

respectively the clear distance between the self vehicle and the front vehicle in the target lane and the normalized threshold value of the clear distance; d is a radical of _f And

respectively the clear distance between the self vehicle and the rear vehicle in the target lane and the normalized threshold value of the clear distance;

332 Generates a random number rnd ∈ (0,1) if rnd < ε _j Generating a random number rnd 2E (0,1), selecting the lane change action of the self-vehicle as the lane change when rnd2 is more than 1/2, executing the step 334), selecting the lane change action of the self-vehicle as the non-lane change when rnd2 is less than or equal to 1/2, and executing the step 335); if rnd > ε _j Then step 333) is performed;

333 Read status s from the current Q table _j Corresponding Q value Q(s) _j ) If Q(s) _j ) If the value is more than 0, the lane change action of the vehicle is selected as the lane change, and the step 334 is executed, if Q(s) is reached _j ) If the lane change is less than or equal to 0, selecting the lane change action of the self vehicle as the lane change-free lane, and executing step 332);

334 Make the self-vehicle execute the lane-changing operation under the virtual lane-changing scene, collect the current state s _j And corresponding prize r _j Setting a reward r according to whether a collision occurs after a lane change operation _j The expression is as follows:

in the formula, r _s For successful lane change award, r _f Awarding for lane change failure;

will state s _j And a prize r _j Form status reward pairs(s) _j ，r _j ) Storing the experience playback pool D, and executing step 335);

335 Determine status rewards in the experience playback pool D for individualsWhether the number reaches the minimum number of samples N _start If yes, executing step 34), if not, executing step 35);

34 Randomly sampling a plurality of state reward pairs from an experience playback pool and training the decision model as a batch of training samples, wherein the method specifically comprises the following steps:

341 Randomly sampling N stateful reward pairs(s) from the empirical playback pool D _i ，r _i ) Form a current training sample set X _i ，i∈[1，N]；

342 ) set X of current training samples _i Each state s in (1) _i Inputting the decision model to obtain the corresponding Q value Q(s) _i ) Enabling the decision model to learn a mapping relationship between the states and the rewards;

343 Update parameters)

Update the Q table as follows:

Q(s _i )←(1-η)Q(s _i )+ηr _i

wherein eta is the learning rate;

updating the network parameters of the decision model to reduce the loss function of the decision model according to the following formula:

updating the probability parameter ε of a greedy algorithm according to _j ：

Performing step 35);

35 Let j = j +1, return to step 32).

In some embodiments, the lane change success award r is determined as follows _s And said lane change failure award r _f ：

Let p (r) be a logistic regression-like method _s |s)＞-r _f /(r _s -r _f ) When Q(s) = p (r) is satisfied _s |s)r _s +p(r _f |s)r _f When > 0, r is obtained _f /r _s = -5, let r _s ＝1，r _f = 5, wherein p (r) _s | s) indicates successful lane change in state s, i.e. a reward r is achieved _s Probability of p (r) _f | s) indicates a lane change failure in state s, i.e. a collision occurs with a reward r _f The probability of (c).

The embodiment of the second aspect of the present disclosure provides an intelligent automobile lane change decision-making device, including:

the system comprises a first module, a second module and a third module, wherein the first module is configured to utilize a simulation method to construct a virtual lane changing scene, the virtual lane changing scene comprises lanes, intelligent automobiles and randomly generated traffic flows, and the lanes at least comprise lanes where the intelligent automobiles are located and target lanes where the intelligent automobiles are expected to be changed;

the second module is configured to construct a decision model based on a single-step deep Q neural network, is used for determining a mapping function between a state and a reward, and obtains a Q table for evaluating the lane changing feasibility of the intelligent automobile based on the mapping function so as to obtain a lane changing decision, namely 'lane changing' or 'lane not changing';

the third module is configured to enable the intelligent automobile to generate random lane changing behaviors based on a greedy algorithm under a constructed virtual lane changing scene, record the state and reward of the current lane changing behaviors to form a current state reward pair, store the current state reward pair into an experience playback pool, randomly sample a plurality of state reward pairs from the experience playback pool as a batch of training sample pair decision models to perform single-step training when the number of the state reward pairs in the experience playback pool reaches the minimum sample number, store a new state reward pair generated in the training process into the experience playback pool, and continuously repeat the training process until the maximum training times are reached to obtain a lane changing decision model;

and the fourth module is configured to acquire the state of the real lane change scene where the intelligent automobile is located, and input the state into the lane change decision model to obtain the lane change decision of the intelligent automobile.

An embodiment of a third aspect of the present disclosure provides an electronic device, including:

at least one processor, and a memory communicatively coupled to the at least one processor;

wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform an intelligent vehicle lane change decision method provided in accordance with any embodiment of the first aspect of the present disclosure.

A computer-readable storage medium provided in an embodiment of a fourth aspect of the present disclosure is characterized in that the computer-readable storage medium stores computer instructions, where the computer instructions are configured to enable the computer to execute the intelligent automobile lane change decision-making method provided in any embodiment of the first aspect of the present disclosure.

The embodiment of the disclosure has the following characteristics and beneficial effects:

1. saving the development cost of the control strategy: the workload of control parameter calibration personnel is reduced, the calibration time is shortened, and the economic cost is reduced.

2. The strategy has fast training and verification speed and high quality: compared with a method for collecting a large amount of data in a road to train neural network control parameters by depending on an actual vehicle, the virtual environment can provide a large amount of simulation scene data to train an intelligent agent to output a reliable control strategy; the virtual environment constructs various traffic phase traffic flow scenes which can better reflect actual manned traffic flow, and the generalization capability of the decision module is ensured; meanwhile, the self-vehicle adopts an accurate dynamic model which is allowed by the academic world, so that the simulation of the lane changing process is real, and the decision-making model is guaranteed to have strong adaptability to the lane changing process with different speeds and tracks.

3. The DQN algorithm was adjusted for lane change decisions: the single step reward value inspired by the logistic regression principle is adopted, and the state reached by the training convergence approaches to the channel changing success rate set theoretically; compared with the general multi-step awarded DQN, the single-step DQN method has the advantage that the number of collisions is rapidly converged under the same training cycle number, which shows that the method is efficient and safe for lane change decision.

3. Easy deployment: the neural network has a simple structure and good real-time performance, can be agilely deployed in an MCU (micro controller Unit), and has less requirement on the memory of a control chip.

Drawings

Fig. 1 is an overall flowchart of an intelligent automobile lane change decision method provided in an embodiment of a first aspect of the disclosure.

Fig. 2 is a schematic structural diagram of a decision model based on a single-step deep Q neural network constructed in a method provided by an embodiment of the first aspect of the disclosure.

Fig. 3 is a state diagram of a lane change scene in the method according to the embodiment of the first aspect of the disclosure.

Fig. 4 is a flowchart of training a decision model in a method provided by an embodiment of the first aspect of the disclosure.

Fig. 5 is a graph of the number of collisions occurring during a test to determine the maximum number of cycles in a method provided in an embodiment of the first aspect of the disclosure as a function of the number of training rounds.

Fig. 6 is a graph of the number of times of collision occurring when the single-step DQN used in the method provided in the embodiment of the first aspect of the present disclosure and the DQN used in the existing method generate a lane change decision, as a function of the number of times of training.

Fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of a third aspect of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

On the contrary, this application is intended to cover any alternatives, modifications, equivalents, and alternatives that may be included within the spirit and scope of the application as defined by the appended claims. Furthermore, in the following detailed description of the present application, certain specific details are set forth in order to provide a thorough understanding of the present application. It will be apparent to one skilled in the art that the present application may be practiced without these specific details.

Referring to fig. 1, an intelligent automobile lane change decision method provided in an embodiment of the disclosure in a first aspect includes:

3) The method comprises the steps that a random lane changing behavior of the intelligent automobile is generated based on a greedy algorithm under a constructed virtual lane changing scene, the state and reward of the current lane changing behavior are recorded to form a current state reward pair, the current state reward pair is stored in an experience playback pool, when the number of the state reward pairs in the experience playback pool reaches the minimum sample number, a plurality of state reward pairs are randomly sampled from the experience playback pool and used as a batch of training sample pairs to perform single-step training on a decision model, new state reward pairs generated in the training process are stored in the experience playback pool, the training process is repeated continuously until the maximum training times are reached, and the lane changing decision model is obtained; the maximum number of training sessions is determined according to the following steps: adding different traffic flow working conditions in the constructed virtual lane changing scene, enabling the intelligent automobile to perform lane changing tests under different relative states between the intelligent automobile and vehicles in a target lane aiming at each working condition, counting the times of collision after the intelligent automobile performs lane changing actions, taking the times of lane changing tests when the times of collision are converged to be close to 0 as the total times of the lane changing tests of each working condition, and taking the maximum value of the total times of the lane changing tests of all the working conditions as the maximum training times;

In some embodiments, a virtual lane-changing scene is constructed by using a CarMaker for Simulink module in CarMaker software, and is used for simulating the driving of an intelligent automobile and generating a traffic environment. Specifically, an intelligent automobile model of 2 lanes and an automobile is established in a CarMaker, the automobile runs in one lane, the initial automobile speed is randomly changed, a traffic flow is randomly generated in the other lane (namely a target lane which the automobile is expected to be switched into), the intelligent automobile senses the motion conditions of the automobile and the front and back automobiles of the target lane in real time and reacts on the motion conditions, the automobile speed or lane change is automatically adjusted, the safety of the lane change decision mainly depends on whether the automobile changes the lane, so the speed adjustment and the lane change track are determined by a default automobile following and lane change model in CarMaker software, the embodiment of the disclosure only focuses on training an intelligent lane change decision (0-1 decision whether to change the lane), namely, the lane change instruction is sent or not sent, and the specific lane change actions (such as the lane change speed and the lane change track) adopted by the intelligent automobile after receiving the sent lane change instruction are not in the protection scope of the embodiment of the disclosure.

Further, the parameters of the traffic flow randomly generated within the lane include: average speed of a lane (km/h), traffic flow density of the lane (veh/km) and center-to-center distance (m) between two adjacent vehicles in the lane.

In some embodiments, referring to fig. 2, the constructed single-step deep Q neural network-based decision model includes an input layer, a hidden layer, and an output layer connected in sequence. And inputting the state s into an input layer, coding, extracting the characteristics of the obtained coding result through a hidden layer, and decoding the extracted characteristics through an output layer to obtain a Q value Q(s) corresponding to the state s. Optionally, the input layer contains 5 nodes; the hidden layer consists of two full-connection layers, each full-connection layer comprises 50 nodes, and the activation functions are tanh; the output layer comprises 1 node, when Q(s) ≧ 0, the corresponding "lane change", when Q(s) < 0, the corresponding "not lane change", its activation function is purelin. The embodiment of the disclosure adopts the neural network based on single-step reinforcement learning, so that the problem can be greatly simplified, the training time can be shortened, and a decision model with better robustness can be obtained under the same training times.

In some embodiments, referring to fig. 3, step 3) specifically comprises the following steps:

31 Setting a cycle ordinal number as j, initializing j =1, initializing a network parameter theta of a decision model, constructing a Q table, and initializing Q values in the Q table to be 0, and setting the maximum training frequency as M =6000; establishing an experience playback pool D, initializing the experience playback pool D into an empty set, and setting the experience playback poolMinimum number of samples N of D _start =200, and the number of randomly acquired state reward pairs from the experience playback pool D each time is set to N =32; let the probability parameter of the greedy algorithm be ε _j And setting the initial value and the final value thereof as epsilon _start =0.9 and ε _end =0, and the probability attenuation parameter of the greedy algorithm is ∈ _decay ＝200；

331 In a virtual lane change scene, randomly initializing a state, and initializing a lane change action of a self vehicle as a non-lane change; referring to fig. 4, a virtual lane change scene constructed according to the embodiment of the disclosure includes two lanes, which are a current lane where a controlled vehicle, i.e., a host vehicle, is located and a target lane where the host vehicle is expected to change into, and randomly generates a traffic flow simulating a driving behavior of a person in the target lane and updates a state s _j Let the expression of the state be:

in the formula, v _e And

respectively setting the speed of the vehicle and the normalized threshold value thereof

v _l And

respectively setting the speed of the front vehicle in the target lane and the normalized threshold value thereof

v _f And

respectively the speed of the rear vehicle in the target lane and the normalized threshold value thereof,is provided with

d _l And

respectively setting the clear distance between the self vehicle and the front vehicle in the target lane and the normalized threshold value of the clear distance

d _f And

respectively the clear distance between the vehicle and the rear vehicle in the target lane and the normalization threshold value of the clear distance

v _e (units m/s), v _l (units m/s), d _l (units m), v _f (units m/s) and d _f (unit m) randomly generating;

332 Generates a random number rnd ∈ (0,1) if rnd ≦ ε _j Generating a random number rnd 2E (0,1), selecting the lane change action of the self-vehicle as the lane change when rnd2 is more than 1/2, executing the step 334), selecting the lane change action of the self-vehicle as the non-lane change when rnd2 is less than or equal to 1/2, and executing the step 335); if rnd > ε _j Then step 333) is performed;

334 Make the self-vehicle execute lane-changing operation under the virtual lane-changing scene, collect the current state s _j And a corresponding prize r _j Setting a reward r according to whether a collision occurs after a lane change operation _j The expression is as follows:

in the formula, r _s Award for successful lane changeExciter r _f Rewarding for lane changing failure;

335 To determine whether the number of status award pairs in the empirical playback pool D reaches the minimum number N of samples _start If yes, executing step 34), if not, executing step 35);

34 Randomly sampling a plurality of state reward pairs from an experience playback pool, and training a decision model as a batch of training samples, wherein the method specifically comprises the following steps:

341 Randomly sampling N =32 state reward pairs(s) from the empirical playback pool D _i ，r _i ) Form a current training sample set X _i ，i∈[1，N]；

342 ) set X of current training samples _i Of (2) each state s _i Inputting the decision model to obtain the corresponding Q value Q(s) _i ) Enabling the decision model to learn the mapping relation between the states and the rewards;

343 Update parameters)

Update the Q table as follows:

Q(s _i )←(1-η)Q(s _i )+ηr _i

wherein η is the learning rate, optionally η =0.001;

updating the probability parameter ε of a greedy algorithm according to _j ：

Performing step 35);

35 Let j = j +1, return to step 32).

Further, the maximum training number M in step 31) is determined according to the following steps:

According to the traffic flow theory, the traffic flow working conditions of a free phase, a smooth phase, a forced phase and a blocking phase are established in a CarMaker, the four working conditions really reflect different working conditions of an actual road, and parameters corresponding to the four working conditions are shown in a table 1:

TABLE 1 traffic flow condition parameter table

In table 1, the traffic flow speed reflects the average speed of the traffic flow in the road, and the traffic density reflects the number of vehicles contained in each kilometer;

3121 Setting an initial test time, all vehicles are in a static state, the center distance between adjacent front and rear vehicles in a target lane is shown in table 1, and the distance between the current lane and the current lane is 5m;

3122 To start the vehicles in the target lane, the vehicles in the target lane are controlled by a default vehicle following and lane changing model in the CarMaker software to reach the target speed, namely the traffic flow speed set in the table 1, and the distance between the vehicles in the target lane is kept unchanged in the whole process; after delaying a period of time (each lane change test adopts random delay time to ensure that the vehicle in the own vehicle and the vehicle in the target lane are in different relative states in different tests), the own vehicle is started and controlled by a default vehicle following and lane change model in CarMaker software to reach the target vehicle speed, namely the traffic flow speed set in the table 1.

3123 After the speed of the automatic driving automobile reaches the traffic flow speed, the front automobile in the self automobile lane starts to decelerate to 40 percent of the traffic flow speed; the self-vehicle reduces the speed according to the randomly set delay time and follows the front vehicle to reduce the speed, starts to judge the lane changing feasibility after the speed of the front vehicle is lower than 50% of the traffic flow speed, and executes the lane changing action after judging that the lane can be changed;

313 For each traffic flow working condition, counting the number of times of collision after the self vehicle executes the lane change action in 100 times of lane change tests, taking the number of times of lane change tests when the number of times of collision is converged to be close to 0 as the total number of times of the lane change tests of each working condition, and taking the maximum value of the total number of times of the lane change tests of all the working conditions as the maximum training number M.

According to an embodiment of the present disclosure, referring to fig. 5, when the average total cycle number (also called training round number) under each traffic phase is greater than 3000, it can be guaranteed that the number of times of collision occurring in every 100 lane change decisions converges to approximately 0; under extreme conditions (such as initial conditions of the jam phase), 6000 cycles can also ensure that the number of collisions is 0, so M is taken to be 6000. According to the method for determining the maximum training time M provided by the embodiment of the disclosure, the generated lane change decision has higher safety even under extreme congestion conditions, and more training time is not required to be consumed compared with a general method while strictly avoiding collision.

In some embodiments, the lane change success reward r is determined according to the following steps _s And a lane change failure award r _f ：

Because the Q value in the single-step reinforcement learning depends on the single-step reward value r _s And r _f The frequency expectation of (c) is averaged, and according to literature research, an average of 80% of lane change frequency success can be regarded as an effective lane change decision model, and the disclosed embodiment adopts a similar logistic regression method to make p (r) _s |s)＞-r _f /(r _s -r _f ) When Q(s) = p (r) _s |s)r _s +p(r _f |s)r _f > 0, to yield r _f /r _s = 5, wherein p (r) _s | s) indicates successful lane change in state s, i.e. a reward r is achieved _s Probability of (c), p (r) _f S) indicates that the lane change fails in the state s, namely, the collision occurs to obtain the reward r _f The probability of (c). Preferably, let r _s ＝1，r _f ＝-5。

R provided according to the present disclosure _s And r _f The method can ensure the lane change strategy effect obtained by the reinforcement learning of the intelligent agentThe convergence is within the expected channel change success rate, and the total cycle number required by the convergence is within a reasonable range.

In order to verify the effectiveness of the intelligent automobile lane change decision method provided by the embodiment of the first aspect of the disclosure, the relationship between the number of times of lane change collision per 100 times of the existing multi-step DQN and the single-step DQN adopted in the embodiment and the number of times of training cycles is compared, referring to fig. 6, it can be seen that under the same number of cycles, the number of times of collision of the multi-step DQN method is not significantly asymptotically converged to 0, absolute safety cannot be guaranteed under a larger number of cycles, and faster convergence of the DQN method is 0, which illustrates that the training method of the embodiment sets the problem of lane change decision to be more efficient and can guarantee lane change safety.

the system comprises a first module, a second module and a third module, wherein the first module is configured to utilize a simulation method to construct a virtual lane changing scene, the virtual lane changing scene comprises lanes, intelligent automobiles and randomly generated traffic flows, and the lanes at least comprise lanes where the intelligent automobiles are located and a target lane;

a second module, configured to construct a decision model based on a single-step deep Q neural network, for determining a mapping function between the states and the rewards, and obtaining a Q table for evaluating lane change feasibility based on the mapping function, so as to obtain a lane change decision, i.e., "lane change" or "lane not change";

a third module, configured to enable the intelligent automobile to generate random lane changing behaviors based on a greedy algorithm under a constructed virtual lane changing scene, record the state and reward of the current lane changing behaviors to form a reward pair of the current state, store the reward pair of the current state into an experience playback pool, randomly sample a plurality of state reward pairs from the experience playback pool as a batch of training sample to perform single-step training on the decision model when the number of the state reward pairs in the experience playback pool reaches a minimum sample number, store a new state reward pair generated in the training process into the experience playback pool, and continuously repeat the training process until the maximum training times are reached to obtain a lane changing decision model; the maximum number of training sessions is determined according to the following steps: adding different traffic flow working conditions in the constructed virtual lane changing scene, enabling the intelligent automobile to perform lane changing tests under different relative states between the intelligent automobile and vehicles in a target lane aiming at each working condition, counting the times of collision of the intelligent automobile after lane changing actions are executed, taking the times of lane changing tests when the times of collision are converged to be close to 0 as the total times of the lane changing tests of each working condition, and taking the maximum value of the total times of the lane changing tests of all the working conditions as the maximum training times;

In order to implement the foregoing embodiment, the embodiment of the present disclosure further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor, and is used to execute the intelligent automobile lane change decision method of the foregoing embodiment.

Referring now to FIG. 7, shown is a schematic diagram of an electronic device suitable for use in implementing embodiments of the present disclosure. It should be noted that the electronic devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, servers, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 7, the electronic device may include a processing apparatus (e.g., a central processing unit, a graphic processor, etc.) 101, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 102 or a program loaded from a storage apparatus 108 into a Random Access Memory (RAM) 103. In the RAM103, various programs and data necessary for the operation of the electronic apparatus are also stored. The processing device 101, the ROM102, and the RAM103 are connected to each other via a bus 104. An input/output (I/O) interface 105 is also connected to bus 104.

Generally, the following devices may be connected to the I/O interface 105: input devices 106 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, etc.; an output device 107 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 108 including, for example, magnetic tape, hard disk, etc.; and a communication device 109. The communication means 109 may allow the electronic device to communicate wirelessly or by wire with other devices to exchange data. While fig. 7 illustrates an electronic device having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, the present embodiments include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 109, or installed from the storage means 108, or installed from the ROM 102. The computer program, when executed by the processing device 101, performs the above-described functions defined in the methods of the embodiments of the present disclosure.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: constructing a virtual lane changing scene by using a simulation method, wherein the virtual lane changing scene comprises lanes, intelligent automobiles and traffic flow generated randomly, and the lanes at least comprise lanes where the intelligent automobiles are located and a target lane; a decision model based on a single-step deep Q neural network is constructed and used for determining a mapping function between a state and an award, and a Q table used for evaluating the lane changing feasibility is obtained based on the mapping function so as to obtain a lane changing decision, namely a 'lane changing' or 'no lane changing'; enabling the intelligent automobile to generate random lane changing behaviors based on a greedy algorithm under a constructed virtual lane changing scene, recording the state and reward of the current lane changing behaviors to form a current state reward pair, storing the current state reward pair into an experience playback pool, randomly sampling a plurality of state reward pairs from the experience playback pool as a batch of training sample pair decision models to perform single-step training when the number of the state reward pairs in the experience playback pool reaches the minimum sample number, storing new state reward pairs generated in the training process into the experience playback pool, and continuously repeating the training process until the maximum training times is reached to obtain a lane changing decision model; the maximum number of training sessions is determined according to the following steps: adding different traffic flow working conditions in the constructed virtual lane changing scene, enabling the intelligent automobile to perform lane changing tests under different relative states between the intelligent automobile and vehicles in a target lane aiming at each working condition, counting the times of collision after the intelligent automobile performs lane changing actions, taking the times of lane changing tests when the times of collision are converged to be close to 0 as the total times of the lane changing tests of each working condition, and taking the maximum value of the total times of the lane changing tests of all the working conditions as the maximum training times; and acquiring the state of the real lane change scene of the intelligent automobile, and inputting the state into the lane change decision model to obtain the lane change decision of the intelligent automobile.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, python, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by a program instructing associated hardware to complete, and the developed program may be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. An intelligent automobile lane change decision method is characterized by comprising the following steps:

2. The intelligent automobile lane-changing decision method according to claim 1, wherein in step 1), the randomly generated parameters of the traffic flow comprise: the average speed of the lane, the traffic flow density of the lane and the center-to-center distance between two adjacent vehicles in the lane.

3. The intelligent vehicle lane-change decision method according to claim 1, wherein the maximum number of training times is determined according to the following steps:

4. The intelligent automobile lane change decision method according to claim 1, wherein the maximum training times are determined according to the following steps:

3121 Setting an initial test time, wherein all vehicles are in a static state, the central distance between the adjacent front and rear vehicles in the target lane is the central distance set in the step 311), and setting an initial value of the distance between the current vehicle and the front vehicle in the current lane;

5. The intelligent automobile lane change decision method according to claim 4, wherein in the steps 3122) and 3123), the delay time of the automobile is randomly set.

6. The intelligent automobile lane change decision method according to claim 1, wherein the step 3) specifically comprises the following steps:

in the formula, v _e And

respectively the clear distance between the self vehicle and the front vehicle in the target lane and the normalized threshold value of the clear distance; d _f And

332 Generates a random number rnd ∈ (0,1) if rnd < ε _j Then, a random number rnd 2E (0,1) is generated, and when rnd2 is greater than 1/2, the change of the bicycle is selectedTaking the lane change as the lane change, executing step 334), when rnd2 is less than or equal to 1/2, selecting the lane change of the self vehicle as the non-lane change, and executing step 335); if rnd > ε _j Then step 333) is performed;

334 Make the self-vehicle execute the lane-changing operation under the virtual lane-changing scene, collect the current state s _j And a corresponding prize r _j Setting a reward r according to whether a collision occurs after a lane change operation _j The expression is as follows:

in the formula, r _s For successful lane change award, r _f Rewarding for lane changing failure;

341 Randomly sampling N status reward pairs(s) from an empirical playback pool D _i ，r _i ) Form a current training sample set X _i ，i∈[1，N]；

343 Update parameters)

Update the Q table as follows:

Q(s _i )←-(1-η)Q(s _i )+ηr _i

wherein eta is the learning rate;

updating the probability parameter ε of a greedy algorithm according to _j ：

Performing step 35);

35 Let j = j +1, return to step 32).

7. The intelligent vehicle lane-change decision method according to claim 6, wherein the lane-change success award r is determined according to the following steps _s And said lane change failure award r _f ：

8. The utility model provides an intelligence car decision-making device that changes lanes which characterized in that includes:

the second module is configured to construct a decision model based on a single-step deep Q neural network, and is used for determining a mapping function between a state and a reward, and a Q table used for evaluating the lane changing feasibility of the intelligent automobile is obtained based on the mapping function so as to obtain a lane changing decision, namely a lane changing decision or a lane not changing decision;

a third module, configured to enable the intelligent automobile to generate random lane changing behaviors based on a greedy algorithm under a constructed virtual lane changing scene, record the state and reward of the current lane changing behaviors to form a reward pair of the current state, store the reward pair of the current state into an experience playback pool, randomly sample a plurality of state reward pairs from the experience playback pool as a batch of training sample to perform single-step training on the decision model when the number of the state reward pairs in the experience playback pool reaches a minimum sample number, store a new state reward pair generated in the training process into the experience playback pool, and continuously repeat the training process until the maximum training times are reached to obtain a lane changing decision model;

9. An electronic device, comprising:

wherein the memory stores instructions executable by the at least one processor, the instructions being configured to perform the intelligent vehicle lane-change decision method of any of claims 1-7.

10. A computer-readable storage medium storing computer instructions for causing a computer to perform the intelligent vehicle lane-change decision method of any one of claims 1 to 7.