CN113835421B

CN113835421B - Method and device for training driving behavior decision model

Info

Publication number: CN113835421B
Application number: CN202010508722.3A
Authority: CN
Inventors: 何祥坤; 陈晨; 刘武龙
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-06-06
Filing date: 2020-06-06
Publication date: 2023-12-15
Anticipated expiration: 2040-06-06
Also published as: CN113835421A; WO2021244207A1

Abstract

The application relates to a driving behavior decision technology in the automatic driving field in the artificial intelligence field, and provides a method and a device for training a driving behavior decision model, which can be applied to intelligent automobiles in the automatic driving field. The method comprises the following steps: using a driving behavior decision model to make a decision according to the state information of the vehicle to obtain driving behavior decision information; sending the driving behavior decision information to a server; receiving a first parameter of a simulated learning model sent by the server, wherein the first parameter is obtained after the server trains the simulated learning model by using the driving behavior decision information based on a simulated learning method; and adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameters. The method provided by the embodiment of the application is beneficial to improving the training efficiency of the driving behavior decision model, and the driving behavior decision model obtained after training can output reasonable and reliable driving behavior decision information.

Description

Method and device for training driving behavior decision model

Technical Field

The present application relates to the field of autopilot, and more particularly to a method and apparatus for training a driving behavior decision model.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, man-machine interaction, recommendation and search, AI-based theory, and the like.

Autopilot is a mainstream application in the field of artificial intelligence, and autopilot technology relies on cooperation of computer vision, radar, a monitoring device, a global positioning system and the like, so that an automotive vehicle can realize autopilot without active operation of human beings. Autonomous vehicles use various computing systems to assist in transporting passengers from one location to another. Some autonomous vehicles may require some initial input or continuous input from an operator (such as a pilot, driver, or passenger). The autonomous vehicle permits the operator to switch from a manual mode of operation to an autonomous mode or a mode in between. Because the automatic driving technology does not need human to drive the motor vehicle, the automatic driving technology can effectively avoid driving errors of human in theory, reduce traffic accidents and improve the transportation efficiency of the highway. Thus, autopilot technology is becoming more and more important.

Driving behavior decisions are an important component in automatic driving technology, specifically comprising selecting actions to be performed (e.g. acceleration, deceleration or steering, etc.) for a vehicle according to state information of the vehicle, and controlling the vehicle according to the selected actions to be performed. The driving behavior decision is typically extrapolated from a driving behavior decision model. The usual driving behavior decision model is obtained through reinforcement learning training. However, the training efficiency of the driving behavior decision model by the reinforcement learning method is low in the prior art.

Disclosure of Invention

The application provides a method and a device for training a driving behavior decision model, which are beneficial to improving the training efficiency of the driving behavior decision model.

In a first aspect, a method of training a driving behavior decision model is provided, the method comprising:

using a driving behavior decision model to make a decision according to the state information of the vehicle to obtain driving behavior decision information; sending the driving behavior decision information to a server; receiving a first parameter of a simulated learning model sent by the server, wherein the first parameter is obtained after the server trains the simulated learning model by using the driving behavior decision information based on a simulated learning method; and adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameters.

The imitative learning method belongs to a common supervised learning method. In general, the supervised learning method can calculate a loss value of a model (e.g., a driving behavior decision model) by using a true value (or a label) in a training process, and adjust parameters of the model by using the calculated loss value, so that the learning efficiency of the supervised learning method is higher, the model meeting the user requirement can be obtained in a shorter time based on the supervised learning method, and meanwhile, the model trained based on the supervised learning method is more reliable due to the true value participation in the training process.

In the embodiment of the application, the first parameter is obtained after the server trains the simulated learning model by using the driving behavior decision information based on the simulated learning method, the training effect of the simulated learning model can be ensured based on the simulated learning method, and at the moment, the parameter of the driving behavior decision model is adjusted according to the driving behavior decision information and the first parameter, so that the learning efficiency of the driving behavior decision model can be improved.

The imitative learning method may include supervised learning (supervised learning), generation of an countermeasure network (generative adversarial network, GAN), inverse reinforcement learning (inverse reinforcement learning, IRL), and the like.

With reference to the first aspect, in certain implementation manners of the first aspect, the adjusting parameters of the driving behavior decision model according to the driving behavior decision information and the first parameters includes: based on a reinforcement learning method, adjusting parameters of the driving behavior decision model according to the driving behavior decision information to obtain second parameters; and adjusting the second parameter of the driving behavior decision model according to the first parameter.

In the embodiment of the application, the second parameter can be obtained by adjusting the parameter of the driving behavior decision model based on the reinforcement learning method, and the second parameter of the driving behavior decision model can be adjusted according to the first parameter, so that the driving behavior decision model has online learning capability and offline learning capability, and the learning efficiency of the driving behavior decision model can be further improved on the premise that the driving behavior decision model has online learning capability.

With reference to the first aspect, in certain implementations of the first aspect, the driving behavior decision model includes a first model and a second model; the reinforcement learning method is used for adjusting the parameters of the driving behavior decision model according to the driving behavior decision information to obtain second parameters, and comprises the following steps: based on a reinforcement learning method, adjusting parameters of the first model according to the driving behavior decision information to obtain the second parameters; and under the condition that a first preset condition is met, updating the parameters of the second model into the second parameters, wherein the first preset condition is a preset time interval or a preset number of times of parameter adjustment of the first model.

In the embodiment of the application, under the condition that the first preset condition is met, the parameters of the second model are updated to the second parameters, so that unstable output of the second model caused by frequent adjustment of the parameters of the second model can be avoided, and the reliability of the driving behavior decision information can be improved.

With reference to the first aspect, in certain implementation manners of the first aspect, the adjusting the second parameter of the driving behavior decision model according to the first parameter includes: and adjusting parameters of the first model and/or parameters of the second model according to the first parameters.

In the embodiment of the application, the parameters of at least one of the first model and the second model can be flexibly adjusted according to the first parameters.

With reference to the first aspect, in some implementation manners of the first aspect, the making a decision using a driving behavior decision model according to the state information, to obtain driving behavior decision information includes: predicting the driving behavior of the vehicle at one or more later moments according to the state information based on the dynamic model and the kinematic model of the vehicle to obtain all possible driving behaviors at the one or more moments; and evaluating all possible driving behaviors by using the driving behavior decision model to obtain the driving behavior decision information.

In the embodiment of the application, the driving behavior decision is carried out by combining the dynamic model and the kinematic model of the vehicle, so that the rationality of the driving behavior decision information can be improved.

With reference to the first aspect, in some implementations of the first aspect, in a case where the driving behavior decision model includes a first model and a second model, the evaluating, using the driving behavior decision model, all the possible driving behaviors to obtain the driving behavior decision information includes: and evaluating all possible driving behaviors by using the second model to obtain the driving behavior decision information.

In the embodiment of the application, the parameter of the first model is changed frequently, and the second model is used for making a decision, so that the reliability of the driving behavior decision information can be improved.

With reference to the first aspect, in certain implementation manners of the first aspect, the method further includes: receiving a third parameter of the simulated learning model sent by the server, wherein the third parameter is obtained after training the simulated learning model by using data output by a decision expert system based on a simulated learning method, and the decision expert system is designed according to driving data of a driver and dynamics characteristics of a vehicle; and determining initial parameters of the driving behavior decision model according to the third parameters.

In the embodiment of the application, the initial parameters of the driving behavior decision model are determined according to the pre-trained third parameters simulating the learning model, so that the stability of the driving behavior decision model can be improved, and the driving behavior decision model is prevented from outputting dangerous (or unreasonable) driving behavior decision information.

With reference to the first aspect, in certain implementation manners of the first aspect, the first parameter is obtained after the server trains the simulated learning model based on a simulated learning method using the driving behavior decision information meeting a second preset condition, where the second preset condition includes that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

In the embodiment of the application, the training effect of the simulated learning model can be further improved by training the simulated learning model by using the reasonable driving behavior decision corresponding to the state information, so that the learning efficiency of the driving behavior decision model can be further improved.

With reference to the first aspect, in certain implementation manners of the first aspect, the second preset condition further includes that noise of the state information is within a first preset range.

In the embodiment of the application, the noise of the state information is in the first preset range, so that the driving behavior decision information obtained after the decision based on the state information is more reasonable, at this time, the simulated learning model is trained according to the driving behavior decision information, the training effect of the simulated learning model can be further improved, and the learning efficiency of the driving behavior decision model can be further improved.

With reference to the first aspect, in certain implementation manners of the first aspect, the state information is one of a plurality of state information, and the second preset condition further includes that the plurality of state information is acquired in a plurality of scenes.

In the embodiment of the application, the state information is acquired in the plurality of scenes, so that the scenes of training data of the driving behavior decision model (for example, driving behavior decision information obtained after decision is made according to the state information) are richer, at this time, the simulated learning model is trained according to the driving behavior decision information, and the training effect of the simulated learning model can be further improved, thereby being beneficial to further improving the learning efficiency of the driving behavior decision model.

With reference to the first aspect, in certain implementation manners of the first aspect, the second preset condition further includes: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

In the embodiment of the application, the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any one of the other scenes is within the second preset range, so that the number of the training data (for example, the driving behavior decision information obtained after decision is made according to the state information) obtained in each scene is more balanced, at this time, the simulated learning model is trained according to the driving behavior decision information, the training effect of the simulated learning model can be ensured, and the problem that the driving behavior decision model is fitted in a certain scene is avoided.

In a second aspect, there is provided a method of training a driving behaviour decision model, the method comprising:

Receiving driving behavior decision information sent by a vehicle, wherein the driving behavior decision information is obtained after the vehicle makes a decision according to the state information of the vehicle by using a driving behavior decision model; training a simulated learning model according to the driving behavior decision information based on a simulated learning method to obtain first parameters of the model learning model, wherein the first parameters are used for adjusting the parameters of the driving behavior decision model; and sending the first parameter to the vehicle.

In the embodiment of the application, based on the simulated learning method, the simulated learning model is trained according to the driving behavior decision information to obtain the first parameter of the model learning model, the training effect of the simulated learning model can be ensured based on the simulated learning method, and at the moment, the parameter of the driving behavior decision model is adjusted according to the first parameter, so that the learning efficiency of the driving behavior decision model can be improved.

With reference to the second aspect, in certain implementations of the second aspect, the method further includes: training the simulated learning model by using data output by a decision expert system based on a simulated learning method to obtain a third parameter of the simulated learning model, wherein the third parameter is used for determining an initial parameter of the driving behavior decision model, and the decision expert system is designed according to driving data of a driver and dynamics characteristics of a vehicle; and sending the third parameter to the vehicle.

With reference to the second aspect, in some implementations of the second aspect, the training a model learning model according to the driving behavior decision information based on a model learning method to obtain a first parameter of the model learning model includes: based on a simulated learning method, training a simulated learning model according to the driving behavior decision information meeting a second preset condition to obtain a first parameter of the model learning model, wherein the second preset condition comprises that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

With reference to the second aspect, in certain implementation manners of the second aspect, the second preset condition further includes that noise of the state information is within a first preset range.

With reference to the second aspect, in certain implementations of the second aspect, the state information is one of a plurality of state information, and the second preset condition further includes that the plurality of state information is acquired in a plurality of scenes.

With reference to the second aspect, in certain implementation manners of the second aspect, the second preset condition further includes: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

In a third aspect, an apparatus for training a driving behavior decision model is provided, comprising:

the decision unit is used for deciding according to the state information of the vehicle by using the driving behavior decision model to obtain driving behavior decision information; a transmitting unit configured to transmit the driving behavior decision information to a server; the receiving unit is used for receiving first parameters of the simulated learning model sent by the server, wherein the first parameters are obtained after the server trains the simulated learning model by using the driving behavior decision information based on a simulated learning method; and the adjusting unit is used for adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameters.

With reference to the third aspect, in certain implementations of the third aspect, the adjusting unit is specifically configured to: based on a reinforcement learning method, adjusting parameters of the driving behavior decision model according to the driving behavior decision information to obtain second parameters; and adjusting the second parameter of the driving behavior decision model according to the first parameter.

With reference to the third aspect, in certain implementations of the third aspect, the driving behavior decision model includes a first model and a second model; wherein, the adjustment unit is specifically used for: based on a reinforcement learning method, adjusting parameters of the first model according to the driving behavior decision information to obtain the second parameters; and under the condition that a first preset condition is met, updating the parameters of the second model into the second parameters, wherein the first preset condition is a preset time interval or a preset number of times of parameter adjustment of the first model.

With reference to the third aspect, in certain implementations of the third aspect, the adjusting unit is specifically configured to: and adjusting parameters of the first model and/or parameters of the second model according to the first parameters.

With reference to the third aspect, in certain implementations of the third aspect, the decision unit is specifically configured to: predicting the driving behavior of the vehicle at one or more later moments according to the state information based on the dynamic model and the kinematic model of the vehicle to obtain all possible driving behaviors at the one or more moments; and evaluating all possible driving behaviors by using the driving behavior decision model to obtain the driving behavior decision information.

With reference to the third aspect, in certain implementations of the third aspect, in case the driving behavior decision model includes a first model and a second model, the decision unit is specifically configured to: and evaluating all possible driving behaviors by using the second model to obtain the driving behavior decision information.

With reference to the third aspect, in certain implementations of the third aspect, the receiving unit is further configured to: receiving a third parameter of the simulated learning model sent by the server, wherein the third parameter is obtained after training the simulated learning model by using data output by a decision expert system based on a simulated learning method, and the decision expert system is designed according to driving data of a driver and dynamics characteristics of a vehicle; the adjusting unit is further configured to: and determining initial parameters of the driving behavior decision model according to the third parameters.

With reference to the third aspect, in some implementations of the third aspect, the first parameter is obtained after the server trains the simulated learning model based on a simulated learning method using the driving behavior decision information that meets a second preset condition, where the second preset condition includes that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

With reference to the third aspect, in certain implementations of the third aspect, the second preset condition further includes that noise of the state information is within a first preset range.

With reference to the third aspect, in certain implementations of the third aspect, the state information is one of a plurality of state information, and the second preset condition further includes that the plurality of state information is acquired in a plurality of scenes.

With reference to the third aspect, in certain implementation manners of the third aspect, the second preset condition further includes: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

In a fourth aspect, an apparatus for training a driving behavior decision model is provided, including:

the driving behavior decision information is obtained after the vehicle makes a decision according to the state information of the vehicle by using a driving behavior decision model; the training unit is used for training a simulated learning model according to the driving behavior decision information based on a simulated learning method to obtain first parameters of the model learning model, wherein the first parameters are used for adjusting the parameters of the driving behavior decision model; and the transmitting unit is used for transmitting the first parameter to the vehicle.

With reference to the fourth aspect, in certain implementations of the fourth aspect, the training unit is further configured to: training the simulated learning model by using data output by a decision expert system based on a simulated learning method to obtain a third parameter of the simulated learning model, wherein the third parameter is used for determining an initial parameter of the driving behavior decision model, and the decision expert system is designed according to driving data of a driver and dynamics characteristics of a vehicle; the transmitting unit is further configured to: and sending the third parameter to the vehicle.

With reference to the fourth aspect, in certain implementations of the fourth aspect, the training unit is specifically configured to: based on a simulated learning method, training a simulated learning model according to the driving behavior decision information meeting a second preset condition to obtain a first parameter of the model learning model, wherein the second preset condition comprises that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

With reference to the fourth aspect, in certain implementation manners of the fourth aspect, the second preset condition further includes that noise of the state information is within a first preset range.

With reference to the fourth aspect, in certain implementation manners of the fourth aspect, the state information is one of a plurality of state information, and the second preset condition further includes that the plurality of state information is acquired in a plurality of scenes.

With reference to the fourth aspect, in certain implementation manners of the fourth aspect, the second preset condition further includes: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

In a fifth aspect, there is provided an apparatus for training a driving behavior decision model, the apparatus comprising a storage medium, which may be a non-volatile storage medium, having stored therein a computer executable program, and a central processor, which is connected to the non-volatile storage medium and which executes the computer executable program to implement the method in any one of the possible implementations of the first aspect.

In a sixth aspect, there is provided an apparatus for training a driving behavior decision model, the apparatus comprising a storage medium, which may be a non-volatile storage medium, having stored therein a computer executable program, and a central processor, connected to the non-volatile storage medium, and executing the computer executable program to implement the method in any one of the possible implementations of the second aspect.

In a seventh aspect, a chip is provided, the chip comprising a processor and a data interface, the processor reading instructions stored on a memory through the data interface, performing the method of any one of the possible implementations of the first aspect or any one of the possible implementations of the second aspect.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the instructions, when executed, are configured to perform the method in any possible implementation manner of the first aspect or any possible implementation manner of the second aspect.

In an eighth aspect, a computer readable storage medium is provided, the computer readable storage medium storing program code for execution by a device, the program code comprising instructions for performing the method of any one of the possible implementations of the first aspect or any one of the possible implementations of the second aspect.

A ninth aspect provides an automobile comprising the apparatus for training a driving behavior decision model according to any one of the possible implementation manners of the third aspect or the fifth aspect.

In a tenth aspect, a server is provided, which comprises the device for training a driving behavior decision model according to any one of the possible implementation manners of the fourth aspect or the sixth aspect.

Drawings

Fig. 1 is a schematic structural diagram of an autopilot vehicle according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an autopilot system according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a neural network processor according to an embodiment of the present application;

fig. 4 is an application schematic diagram of a cloud-side command autopilot vehicle according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a method of training a driving behavior decision model provided by one embodiment of the present application;

FIG. 6 is a schematic block diagram of a method of training a driving behavior decision model provided by another embodiment of the present application;

FIG. 7 is a schematic block diagram of a method of training a driving behavior decision model provided by another embodiment of the present application;

FIG. 8 is a schematic flow chart diagram of a method of training a driving behavior decision model provided by one embodiment of the application;

FIG. 9 is a schematic block diagram of RBFNN provided by an embodiment of the present application;

FIG. 10 is a schematic block diagram of an apparatus for training a driving behavior decision model provided by one embodiment of the application;

FIG. 11 is a schematic block diagram of an apparatus for training a driving behavior decision model provided by another embodiment of the present application;

fig. 12 is a schematic block diagram of an apparatus for training a driving behavior decision model provided in yet another embodiment of the present application.

Detailed Description

The technical scheme of the application will be described below with reference to the accompanying drawings.

The technical scheme of the embodiment of the application can be applied to various vehicles, and the vehicles can be specifically diesel locomotives, intelligent electric vehicles or hybrid electric vehicles, or can be vehicles of other power types, and the like, and the embodiment of the application is not limited to the above.

The vehicle in the embodiment of the present application may be an autonomous vehicle, for example, the autonomous vehicle may be configured with an autonomous mode, and the autonomous mode may be a full autonomous mode, or may be a partial autonomous mode, which is not limited by the embodiment of the present application.

The vehicle in the embodiment of the application can be further configured with other driving modes, and the other driving modes can comprise one or more of a plurality of driving modes such as a sport mode, an economy mode, a standard mode, a snowfield mode, a climbing mode and the like. The autonomous vehicle may switch between the autonomous mode and the various driving models described above (of the driver driving the vehicle), as embodiments of the application are not limited in this regard.

Fig. 1 is a functional block diagram of a vehicle 100 provided by an embodiment of the present application.

In one embodiment, the vehicle 100 is configured in a fully or partially autonomous mode.

For example, the vehicle 100 may control itself while in the automatic driving mode, and the current state of the vehicle and its surrounding environment may be determined by a human operation, the possible behavior of at least one other vehicle in the surrounding environment may be determined, and the confidence level corresponding to the possibility of the other vehicle performing the possible behavior may be determined, and the vehicle 100 may be controlled based on the determined information. While the vehicle 100 is in the autonomous mode, the vehicle 100 may be placed into operation without interaction with a person.

The vehicle 100 may include various subsystems, such as a travel system 102, a sensor system 104, a control system 106, one or more peripheral devices 108, as well as a power source 110, a computer system 112, and a user interface 116.

Alternatively, vehicle 100 may include more or fewer subsystems, and each subsystem may include multiple elements. In addition, each of the subsystems and elements of the vehicle 100 may be interconnected by wires or wirelessly.

The travel system 102 may include components that provide powered movement of the vehicle 100. In one embodiment, propulsion system 102 may include an engine 118, an energy source 119, a transmission 120, and wheels/tires 121. The engine 118 may be an internal combustion engine, an electric motor, an air compression engine, or other type of engine combination, such as a hybrid engine of a gasoline engine and an electric motor, or a hybrid engine of an internal combustion engine and an air compression engine. Engine 118 converts energy source 119 into mechanical energy.

Examples of energy sources 119 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electricity. The energy source 119 may also provide energy to other systems of the vehicle 100.

The transmission 120 may transmit mechanical power from the engine 118 to the wheels 121. The transmission 120 may include a gearbox, a differential, and a drive shaft.

In one embodiment, the transmission 120 may also include other devices, such as a clutch. Wherein the drive shaft may comprise one or more axles that may be coupled to one or more wheels 121.

The sensor system 104 may include several sensors that sense information about the environment surrounding the vehicle 100.

For example, the sensor system 104 may include a positioning system 122 (which may be a GPS system, or a Beidou system or other positioning system), an inertial measurement unit (inertial measurement unit, IMU) 124, radar 126, laser rangefinder 128, and camera 130. The sensor system 104 may also include sensors (e.g., in-vehicle air quality monitors, fuel gauges, oil temperature gauges, etc.) of the internal systems of the monitored vehicle 100. Sensor data from one or more of these sensors may be used to detect objects and their corresponding characteristics (location, shape, direction, speed, etc.). Such detection and identification is a critical function of the safe operation of autonomous vehicle 100.

The positioning system 122 may be used to estimate the geographic location of the vehicle 100. The IMU 124 is used to sense changes in the position and orientation of the vehicle 100 based on inertial acceleration. In one embodiment, the IMU 124 may be a combination of an accelerometer and a gyroscope.

Radar 126 may utilize radio signals to sense objects within the surrounding environment of vehicle 100. In some embodiments, in addition to sensing an object, the radar 126 may be used to sense the speed and/or heading of the object.

The laser rangefinder 128 may utilize a laser to sense objects in the environment in which the vehicle 100 is located. In some embodiments, laser rangefinder 128 may include one or more laser sources, a laser scanner, and one or more detectors, among other system components.

The camera 130 may be used to capture a plurality of images of the surrounding environment of the vehicle 100. The camera 130 may be a still camera or a video camera.

The control system 106 is configured to control the operation of the vehicle 100 and its components. The control system 106 may include various elements including a steering system 132, a throttle 134, a brake unit 136, a sensor fusion algorithm 138, a computer vision system 140, a route control system 142, and an obstacle avoidance system 144.

The steering system 132 is operable to adjust the direction of travel of the vehicle 100. For example, in one embodiment may be a steering wheel system.

The throttle 134 is used to control the operating speed of the engine 118 and thus the speed of the vehicle 100.

The brake unit 136 is used to control the vehicle 100 to decelerate. The brake unit 136 may use friction to slow the wheel 121. In other embodiments, the braking unit 136 may convert the kinetic energy of the wheels 121 into electric current. The brake unit 136 may take other forms to slow the rotational speed of the wheels 121 to control the speed of the vehicle 100.

The computer vision system 140 may be operable to process and analyze images captured by the camera 130 to identify objects and/or features in the environment surrounding the vehicle 100. The objects and/or features may include traffic signals, road boundaries, and obstacles. The computer vision system 140 may use object recognition algorithms, in-motion restoration structure (Structure from Motion, SFM) algorithms, video tracking, and other computer vision techniques. In some embodiments, the computer vision system 140 may be used to map an environment, track objects, estimate the speed of objects, and so forth.

The route control system 142 is used to determine a travel route of the vehicle 100. In some embodiments, route control system 142 may incorporate data from sensor 138, GPS 122, and one or more predetermined maps to determine a travel route for vehicle 100.

The obstacle avoidance system 144 is operable to identify, evaluate, and avoid or otherwise overcome potential obstacles in the environment of the vehicle 100.

Of course, in one example, control system 106 may additionally or alternatively include components other than those shown and described. Or some of the components shown above may be eliminated.

The vehicle 100 interacts with external sensors, other vehicles, other computer systems, or users through the peripheral devices 108. Peripheral devices 108 may include a wireless communication system 146, a vehicle computer 148, a microphone 150, and/or a speaker 152.

In some embodiments, the peripheral device 108 provides a means for a user of the vehicle 100 to interact with the user interface 116. For example, the vehicle computer 148 may provide information to a user of the vehicle 100. The user interface 116 is also operable with the vehicle computer 148 to receive user input. The vehicle computer 148 may be operated by a touch screen. In other cases, the peripheral device 108 may provide a means for the vehicle 100 to communicate with other devices located within the vehicle. For example, microphone 150 may receive audio (e.g., voice commands or other audio input) from a user of vehicle 100. Similarly, speaker 152 may output audio to a user of vehicle 100.

The wireless communication system 146 may communicate wirelessly with one or more devices directly or via a communication network. For example, the wireless communication system 146 may use 3G cellular communications, such as CDMA, EVD0, GSM/GPRS, or 4G cellular communications, such as LTE. Or 5G cellular communication. The wireless communication system 146 may communicate with a wireless local area network (wireless local area network, WLAN) using WiFi. In some embodiments, the wireless communication system 146 may utilize an infrared link, bluetooth, or ZigBee to communicate directly with the device. Other wireless protocols, such as various vehicle communication systems, for example, the wireless communication system 146 may include one or more dedicated short-range communication (dedicated short range communications, DSRC) devices, which may include public and/or private data communications between vehicles and/or roadside stations.

The power source 110 may provide power to various components of the vehicle 100. In one embodiment, the power source 110 may be a rechargeable lithium ion or lead acid battery. One or more battery packs of such batteries may be configured as a power source to provide power to various components of the vehicle 100. In some embodiments, the power source 110 and the energy source 119 may be implemented together, such as in some all-electric vehicles.

Some or all of the functions of the vehicle 100 are controlled by a computer system 112. The computer system 112 may include at least one processor 113, the processor 113 executing instructions 115 stored in a non-transitory computer-readable medium, such as a data storage 114. The computer system 112 may also be a plurality of computing devices that control individual components or subsystems of the vehicle 100 in a distributed manner.

The processor 113 may be any conventional processor, such as a commercially available CPU. Alternatively, the processor may be a special purpose device such as an ASIC or other hardware-based processor. Although FIG. 1 functionally illustrates a processor, memory, and other elements of computer 110 in the same block, it will be understood by those of ordinary skill in the art that the processor, computer, or memory may in fact comprise multiple processors, computers, or memories within the same physical housing. For example, the memory may be a hard disk drive or other storage medium located in a different housing than computer 110. Thus, references to a processor or computer will be understood to include references to a collection of processors or computers or memories that may or may not operate in parallel. Rather than using a single processor to perform the steps described herein, some components, such as the steering component and the retarding component, may each have their own processor that performs only calculations related to the component-specific functions.

In various aspects described herein, the processor may be located remotely from the vehicle and in wireless communication with the vehicle. In other aspects, some of the processes described herein are performed on a processor disposed within the vehicle and others are performed by a remote processor, including taking the necessary steps to perform a single maneuver.

In some embodiments, the data storage 114 may contain instructions 115 (e.g., program logic) that the instructions 115 may be executed by the processor 113 to perform various functions of the vehicle 100, including those described above. The data storage 114 may also contain additional instructions, including instructions to send data to, receive data from, interact with, and/or control one or more of the propulsion system 102, the sensor system 104, the control system 106, and the peripherals 108.

In addition to instructions 115, data storage 114 may also store data such as road maps, route information, vehicle location, direction, speed, and other such vehicle data, as well as other information. Such information may be used by the vehicle 100 and the computer system 112 during operation of the vehicle 100 in autonomous, semi-autonomous, and/or manual modes.

A user interface 116 for providing information to or receiving information from a user of the vehicle 100. Optionally, the user interface 116 may include one or more input/output devices within the set of peripheral devices 108, such as a wireless communication system 146, a car-in-computer 148, a microphone 150, and a speaker 152.

The computer system 112 may control the functions of the vehicle 100 based on inputs received from various subsystems (e.g., the travel system 102, the sensor system 104, and the control system 106) as well as from the user interface 116. For example, the computer system 112 may utilize inputs from the control system 106 to control the steering unit 132 to avoid obstacles detected by the sensor system 104 and the obstacle avoidance system 144. In some embodiments, computer system 112 is operable to provide control over many aspects of vehicle 100 and its subsystems.

Alternatively, one or more of these components may be mounted separately from or associated with vehicle 100. For example, the data storage 114 may exist partially or completely separate from the vehicle 100. The above components may be communicatively coupled together in a wired and/or wireless manner.

Alternatively, the above components are only an example, and in practical applications, components in the above modules may be added or deleted according to actual needs, and fig. 1 should not be construed as limiting the embodiments of the present application.

An autonomous vehicle traveling on a road, such as the vehicle 100 above, may identify objects within its surrounding environment to determine adjustments to the current speed. The object may be another vehicle, a traffic control device, or another type of object. In some examples, each identified object may be considered independently and based on its respective characteristics, such as its current speed, acceleration, distance from the vehicle, etc., may be used to determine the speed at which autonomous silences are to be adjusted.

Alternatively, the vehicle 100 or a computing device associated with the vehicle 100 (e.g., the computer system 112, the computer vision system 140, the data storage 114 of fig. 1) may predict the behavior of the identified object based on the characteristics of the identified object and the state of the surrounding environment (e.g., traffic, rain, ice on a road, etc.). Alternatively, each identified object depends on each other's behavior, so all of the identified objects can also be considered together to predict the behavior of a single identified object. The vehicle 100 is able to adjust its speed based on the predicted behavior of the identified object. In other words, the autonomous vehicle is able to determine what steady state the vehicle will need to adjust to (e.g., accelerate, decelerate, or stop) based on the predicted behavior of the object. In this process, the speed of the vehicle 100 may also be determined in consideration of other factors, such as the lateral position of the vehicle 100 in the road on which it is traveling, the curvature of the road, the proximity of static and dynamic objects, and so forth.

In addition to providing instructions to adjust the speed of the autonomous vehicle, the computing device may also provide instructions to modify the steering angle of the vehicle 100 so that the autonomous vehicle follows a given trajectory and/or maintains safe lateral and longitudinal distances from objects in the vicinity of the autonomous vehicle (e.g., cars in adjacent lanes on the road).

The vehicle 100 may be a car, a truck, a motorcycle, a bus, a ship, an airplane, a helicopter, a mower, an amusement ride, a casino vehicle, construction equipment, an electric car, a golf car, a train, a trolley, or the like, and the embodiment of the present application is not particularly limited.

Fig. 2 is a schematic diagram of an autopilot system according to an embodiment of the present application.

The autopilot system as shown in fig. 2 includes a computer system 101, wherein the computer system 101 includes a processor 103, the processor 103 being coupled to a system bus 105. Processor 103 may be one or more processors, each of which may include one or more processor cores. A display adapter 107, which may drive a display 109, the display 109 being coupled to the system bus 105. The system bus 105 is coupled to an input output (I/O) bus 113 through a bus bridge 111. I/O interface 115 is coupled to an I/O bus. The I/O interface 115 communicates with various I/O devices such as an input device 117 (e.g., keyboard, mouse, touch screen, etc.), a multimedia disk (media track) 121, (e.g., CD-ROM, multimedia interface, etc.). Transceiver 123 (which may transmit and/or receive radio communication signals), camera 155 (which may capture still and moving digital video images), and external USB interface 125. Wherein the interface to which I/O interface 115 is optionally connected may be a USB interface.

The processor 103 may be any conventional processor including a reduced instruction set computing (reduced instruction set computer, RISC) processor, a complex instruction set computing (complex instruction set computer, CISC) processor, or a combination thereof. In the alternative, the processor may be a dedicated device such as an application specific integrated circuit (application specific integrated circuit, ASIC). Alternatively, the processor 103 may be a neural network processor or a combination of a neural network processor and the conventional processors described above.

Alternatively, in various embodiments described herein, the computer system 101 may be located remotely from the autonomous vehicle (e.g., the computer system 101 may be located at a cloud or server), and may be in wireless communication with the autonomous vehicle. In other aspects, some of the processes described herein are performed on a processor disposed within the autonomous vehicle, others are performed by a remote processor, including taking the actions required to perform a single maneuver.

Computer 101 may communicate with software deploying server 149 through network interface 129. The network interface 129 is a hardware network interface, such as a network card. The network 127 may be an external network, such as the Internet, or an internal network, such as an Ethernet or virtual private network (virtual private network, VPN). Optionally, the network 127 may also be a wireless network, such as a WiFi network, cellular network, or the like.

The hard drive interface is coupled to the system bus 105. The hardware drive interface is coupled to the hard disk drive. System memory 135 is coupled to system bus 105. The data running in system memory 135 may include an operating system 137 and application programs 143 for computer 101.

The operating system includes a parser 139 (shell) and a kernel 141. A shell is an interface between a user and the kernel (kernel) of the operating system. A shell is the outermost layer of the operating system. Shell manages interactions between users and the operating system: waiting for user input, interpreting the user input to the operating system, and processing output results of a variety of operating systems.

Kernel 141 is made up of those parts of the operating system that are used to manage memory, files, peripherals, and system resources. The operating system kernel typically runs processes and provides inter-process communication, CPU time slice management, interrupts, memory management, IO management, and so on, directly interacting with the hardware.

The application program 143 includes a driving behavior decision-related program, for example, acquires state information of a vehicle, makes a decision according to the state information of the vehicle, obtains driving behavior decision information (i.e., a to-be-executed action of the vehicle, such as an acceleration, deceleration, or steering action), and controls the vehicle according to the driving behavior decision information. Application 143 also resides on a system of software deployment server 149 (deployment server). In one embodiment, computer system 101 may download application 143 from software deployment server 149 (deployment server) when execution of application 143 is desired.

A sensor 153 is associated with computer system 101. The sensor 153 is used to detect the environment surrounding the computer 101. For example, the sensor 153 may detect animals, automobiles, obstructions, crosswalks, etc., and further the sensor may detect the environment surrounding such animals, automobiles, obstructions, crosswalks, etc., such as: the environment surrounding the animal, e.g., other animals present around the animal, weather conditions, the brightness of the surrounding environment, etc. The sensor 153 may also be used to obtain status information of the vehicle. For example, the sensor 153 may detect state information of the vehicle such as a position of the vehicle, a speed of the vehicle, an acceleration of the vehicle, and a posture of the vehicle. Alternatively, if computer 101 is located on an autonomous vehicle, the sensor may be a camera, infrared sensor, chemical detector, microphone, or the like.

For example, the application 143 may make a decision based on surrounding environmental information detected by the sensor 153 and/or state information of the vehicle, obtain driving behavior decision information, and control the vehicle according to the driving behavior decision information. At this time, the vehicle can be controlled according to the driving behavior decision information, thereby realizing the automatic driving of the vehicle.

The driving behavior decision information may refer to one or more of actions to be performed by the vehicle, such as performing actions of acceleration, deceleration, or steering, or may refer to one or more of control modes or control systems to be selected by the vehicle, such as selecting one or more of steering control systems, direct yaw moment control systems, or emergency braking control systems.

Fig. 3 is a hardware architecture diagram of a chip including a neural network processor 20 according to an embodiment of the present application. The chip may be in the processor 103 as shown in fig. 2 for making driving behavior decisions based on the state information of the vehicle. In an embodiment of the present application, the algorithms of the layers in the pre-trained neural network may be implemented in a chip as shown in fig. 3.

The method for training the driving behavior decision model and the method for determining the driving behavior in the embodiments of the present application may also be implemented in a chip as shown in fig. 3, where the chip may be the same chip as the chip implementing the pre-trained neural network, or the chip may be a different chip from the chip implementing the pre-trained neural network, which is not limited in the embodiments of the present application.

The NPU 50NPU of the neural network processor is mounted as a coprocessor to a Host CPU (Host CPU) which distributes tasks. The NPU has a core part of an arithmetic circuit 50, and the controller 504 controls the arithmetic circuit 503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuitry 203 includes a plurality of processing units (PEs) internally. In some implementations, the operational circuitry 203 is a two-dimensional systolic array. The arithmetic circuitry 203 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operational circuitry 203 is a general purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 202 and buffers the data on each PE in the arithmetic circuit. The arithmetic circuit takes matrix a data from the input memory 201 and performs matrix operation with matrix B, and the obtained partial result or final result of the matrix is stored in an accumulator (accumulator) 208.

The vector calculation unit 207 may further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, vector computation unit 207 may be used for network computation of non-convolutional/non-FC layers in a neural network, such as pooling, batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 207 can store the vector of processed outputs to the unified buffer 206. For example, the vector calculation unit 207 may apply a nonlinear function to an output of the operation circuit 203, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 207 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 203, e.g., for use in subsequent layers in a neural network.

The unified memory 206 is used for storing input data and output data.

The weight data is directly transferred to the input memory 201 and/or the unified memory 206 by the memory cell access controller 205 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 202, and the data in the unified memory 206 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 210 for interfacing between the main CPU, DMAC and the instruction fetch memory 209 via a bus.

An instruction fetch memory (instruction fetch buffer) 209 connected to the controller 204 for storing instructions used by the controller 204;

The controller 204 is configured to invoke an instruction cached in the instruction fetch memory 209, so as to control a working process of the operation accelerator.

Typically, the unified memory 206, the input memory 201, the weight memory 202, and the finger memory 209 are On-Chip (On-Chip) memories, and the external memory is a memory external to the NPU, which may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

Computer system 112 may also receive information from, or transfer information to, other computer systems. Alternatively, sensor data collected from the sensor system 104 of the vehicle 100 may be transferred to another computer for processing of the data.

For example, as shown in fig. 4, data from computer system 312 may be transmitted via a network to cloud-side server 320 for further processing. The networks and intermediate nodes may include various configurations and protocols including the internet, world wide web, intranets, virtual private networks, wide area networks, local area networks, private networks using proprietary communication protocols of one or more companies, ethernet, wiFi and HTTP, and various combinations of the foregoing. Such communication may be performed by any device capable of transmitting data to and from other computers, such as modems and wireless interfaces.

In one example, server 320 may comprise a server having multiple computers, such as a load balancing server farm, that exchanges information with different nodes of a network for the purpose of receiving, processing, and transmitting data from computer system 312. The server may be configured similar to computer system 312 with processor 330, memory 340, instructions 350, and data 360.

Illustratively, the data 360 of the server 320 may include parameters of an offline-learned neural network model (e.g., a deep-learned based neural network model) and related information of the neural network model (e.g., training data of the neural network model or other parameters of the neural network model, etc.). For example, server 320 may receive, detect, store, update, and communicate parameters of an offline learned neural network model and related information for the neural network model.

For example, parameters of an offline learned neural network model may include super parameters of the neural network model, as well as other model parameters (or model policies).

For another example, the related information of the neural network model may include training data of the neural network model, other parameters of the neural network model, and the like.

It should be noted that, the server 320 may also use training data of the neural network model to train the neural network model based on the simulated learning method (i.e., offline training or offline learning), so as to update the parameters of the neural network model.

In the prior art, the driving behavior decision model can be enabled to have online learning capability based on the reinforcement learning method, namely, the driving behavior decision model can be continuously trained in the process of using the driving behavior decision model, so that the driving behavior decision model can be continuously optimized.

However, the reinforcement learning method is a typical unsupervised learning method, and the loss value of a model (for example, a driving behavior decision model) is not calculated by using a true value (or called a label) as in the supervised learning method in the training process, and the convergence speed of the model is accelerated by using the calculated loss value, so that a model meeting the user's needs cannot be obtained in a short time. Therefore, the reinforcement learning method is lower in learning efficiency than the supervised learning method. Moreover, since there is no true value involved in the training process, the reinforcement learning method is not as effective as the supervised learning method, and the obtained model is ensured to be more reliable.

In summary, in the case of training the driving behavior decision model using only the reinforcement learning method, although the driving behavior decision model can be made to have online learning ability, the training efficiency of the model is often not ideal.

Based on the problems, the application provides a method for training a driving behavior decision model, which can improve the training efficiency of the driving behavior decision model. Furthermore, according to the method, the driving behavior decision model can have on-line learning capability and off-line learning capability at the same time, so that the learning efficiency of the driving behavior decision model can be improved on the premise that the driving behavior decision model has the on-line learning capability.

The method for training the driving behavior decision model and the method for determining the driving behavior in the embodiment of the present application are described in detail below with reference to fig. 5 to 10.

Fig. 5 is a schematic flow chart of a method 500 of training a driving behavior decision model provided by an embodiment of the application.

The method 500 shown in fig. 5 may include step 510, step 520, step 530, and step 540, and it should be understood that the method 500 shown in fig. 5 is only exemplary and not limiting, and that more or less steps may be included in the method 500, which are not limited in the embodiments of the present application, and the following detailed descriptions of the steps are provided.

The method 500 shown in fig. 5 may be performed by the processor 113 in the vehicle 100 in fig. 1, or may be performed by the processor 103 in the autopilot system in fig. 2, or may be performed by the processor 330 in the server 320 in fig. 4.

S510, using the driving behavior decision model, deciding according to the state information of the vehicle, and obtaining driving behavior decision information.

The state information of the vehicle may include a position of the vehicle, a speed of the vehicle, an acceleration of the vehicle, a posture of the vehicle, and other state information of the vehicle.

For example, the status information of the vehicle may include a pre-sighting deviation (e.g., a lateral pre-sighting deviation), a yaw rate of the vehicle, and a longitudinal speed of the vehicle.

For example, the state information of the vehicle may be the current state of the vehicle (and/or the current motion of the vehicle) in the method 600 of fig. 6 or the method 700 of fig. 7.

Wherein the driving behavior decision information may be used to indicate an action (or operation) of the vehicle to be performed, for example, performing one or more of an acceleration, deceleration or steering action.

Alternatively, the driving behavior decision information may also refer to a control mode (or control system) of the vehicle to be selected, for example, selecting one or more of a steering control system, a direct yaw moment control system, or an emergency braking control system.

Alternatively, the initial parameters of the driving behavior decision model may be determined from third parameters of a model of imitation learning trained in advance based on a method of imitation learning.

For example, the simulated learning model may be the simulated learning system in the method 700 of fig. 7 or the method 800 of fig. 8.

For example, the third parameter may be obtained after the server (or the cloud) trains the simulated learning model in advance based on the simulated learning method, and after the training is completed, the server (or the cloud) may send the third parameter of the simulated learning model to the vehicle (for example, an autopilot system in the vehicle or a computer system in the vehicle), and further, the vehicle may determine the initial parameter of the driving behavior decision model according to the third parameter of the simulated learning model.

For another example, the third parameter of the learning-mimicking model may be obtained after the vehicle (for example, a processor or a computer system in the vehicle) is trained in advance based on the learning-mimicking method.

It should be noted that, when determining the initial parameter of the driving behavior decision model according to the third parameter, the third parameter may be directly used as the initial parameter of the driving behavior decision model; alternatively, a part of the parameters in the third parameter may be used as a part of the parameters in the initial parameters of the driving behavior decision model (other parameters in the initial parameters of the driving behavior decision model may be determined according to other methods), which is not limited in the embodiment of the present application.

Alternatively, the third parameter may be obtained after the server (or cloud) trains the model based on a model learning method using data output from a decision expert system, which may be designed based on driving data of a driver (for example, the driving data may include operation data of an excellent driver or a professional driver, operation data of a vehicle, and the like) and dynamics of the vehicle.

For example, a rule-based decision expert system may be designed by analyzing driving data of a driver (e.g., an example of an excellent driver performing an emergency collision avoidance operation) and dynamics of a vehicle (e.g., dynamics of a vehicle tire); further, data output by the decision expert system may be collected and labeled (i.e., the data is labeled such that the simulated learning model uses the data to simulate learning), such that the simulated learning model may be trained using the labeled data based on a simulated learning method to obtain a third parameter of the simulated learning model.

Optionally, the driving behavior decision model may include a first model and a second model. For example, the first model may be the current network in the method 700 of fig. 7 or the method 800 of fig. 8, and the second model may be the target network in the method 700 of fig. 7 or the method 800 of fig. 8.

Wherein the first model and the second model may each be reinforcement learning based (driving behavior) decision models, and the initial parameters of the first model and the initial parameters of the second model may be determined from third parameters of a simulation learning model trained in advance based on a simulation learning method.

Optionally, the using the driving behavior decision model to make a decision according to the state information to obtain driving behavior decision information may include:

predicting the driving behavior of the vehicle at one or more later moments according to the state information based on the dynamic model and the kinematic model of the vehicle to obtain all possible driving behaviors at the one or more moments; and evaluating all possible driving behaviors by using the driving behavior decision model to obtain the driving behavior decision information.

For example, all possible driving behaviors of the vehicle (from the current time) at the ith time i, which is a positive integer, may be predicted based on the dynamics model and the kinematic model of the vehicle from the current state information of the vehicle.

In the embodiment of the present application, the driving behavior of the vehicle at one or more subsequent times may be predicted at the same time, which is not limited by the embodiment of the present application.

Optionally, in the case that the driving behavior decision model includes a first model and a second model, the evaluating all the possible driving behaviors using the driving behavior decision model to obtain the driving behavior decision information may include: and evaluating all possible driving behaviors by using the second model to obtain the driving behavior decision information.

Further, the parameters of the second model may be updated periodically according to the parameters of the first model.

For example, in the case that a first preset condition is satisfied, the parameter of the second model may be updated to the second parameter, where the first preset condition may be a time interval preset at intervals, or the first preset condition may also be a preset number of times of adjustment of the parameter of the first model.

And S520, sending the driving behavior decision information to a server.

S530, receiving a first parameter which imitates a learning model and is sent by the server.

The first parameter may be obtained after the server trains the simulated learning model using the driving behavior decision information based on a simulated learning method.

Further, the first parameter may be obtained after the server trains the simulated learning model based on a simulated learning method using the driving behavior decision information satisfying a second preset condition.

Wherein the second preset condition may include at least one of the following:

condition one:

the second preset condition may include: the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

Wherein, the reasonable driving behavior decision refers to a driving behavior decision conforming to a preset rule. For example, the preset rule may be understood as a driving habit of an old driver having a high experience.

The reasonable driving behavior decision can be obtained through an automatic labeling learning method or can be obtained through a manual labeling method.

For example, in the process of linear braking, it is assumed that a reasonable driving behavior decision corresponding to the state information of the vehicle works as an emergency braking control system, and if a driving behavior decision model is used, driving behavior decision information obtained according to the state information decision of the vehicle works as an emergency braking control system, the driving behavior decision information is the same as the reasonable driving behavior decision corresponding to the state information of the vehicle, that is, the driving behavior decision information is the reasonable driving behavior decision corresponding to the state information.

In the embodiment of the application, the reasonable driving behavior decision corresponding to the state information is used, so that the learning efficiency of the driving behavior decision model can be improved.

Condition II:

the second preset condition may further include: the noise of the state information is within a first preset range.

The noise of the state information may include, among other things, interference (e.g., gaussian noise) suffered by the signal of the state information or jitter of the signal of the state information.

Alternatively, the noise of the state information may also include a data error of the state information.

For example, the state information of the vehicle includes a longitudinal speed of the vehicle, and during traveling, assuming that the first preset range is 5 km/hr, if an error of the longitudinal speed of the vehicle is less than (or less than or equal to) 5 km/hr, the driving behavior decision information satisfies the second preset condition, that is, the driving behavior decision information is a correct driving behavior decision corresponding to the state information.

The value of the first preset range in the above embodiment is merely exemplary and not limited, and may be specifically determined according to practical situations, which is not limited in the embodiment of the present application.

In the embodiment of the application, the noise of the state information is in the first preset range, and decision is made according to the state information, so that the obtained driving behavior decision information is more reasonable, and at the moment, the parameters of the driving behavior decision model are adjusted according to the driving behavior decision information, so that the learning efficiency of the driving behavior decision model can be improved.

And (3) a third condition:

the state information may be one of a plurality of state information, and the second preset condition may further include: the plurality of status information is acquired in a plurality of scenarios.

For example, the plurality of scenes may include one or more of high speed, urban, suburban, and mountain scenes.

For another example, the plurality of scenes may also include one or more of an intersection, a T-junction, and a roundabout.

It should be noted that the above-mentioned division manner of the plurality of scenes is merely an example and not a limitation, and in the embodiment of the present application, the scenes may be divided in other manners, or the embodiment of the present application may be applied to scenes where other vehicles can travel, which is not limited herein.

In the embodiment of the application, the state information is acquired in the at least one scene, so that the scenes of training data (for example, driving behavior decision information obtained after decision is made according to the state information) of the driving behavior decision model are richer, thereby being beneficial to further improving the learning efficiency of the driving behavior decision model.

Condition four:

the second preset condition may further include: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

For example, the plurality of status information is obtained in four scenes of a high speed, a urban area, a suburban area and a mountain area, 1000 (or 1000 groups) of status information are obtained in the high speed scene, and 100 (or 100 groups) of status information are obtained in each of the three scenes of the urban area, the suburban area and the mountain area, so that 100 (or 100 groups) of status information can be selected from the 1000 (or 1000 groups) of status information obtained in the high speed scene according to the method of the condition one and the condition two, so that the quantity of the status information obtained in the four scenes is the same.

Alternatively, the difference between the number of the acquired state information in the high-speed scene and the number of the acquired state information in the other scenes may be made within the second preset range.

Alternatively, the plurality of status information may be acquired in other scenarios, which is not limited in the embodiment of the present application.

For example, the plurality of status information may be acquired in a plurality of scenes such as an intersection, a tee-joint, and a roundabout, the number of status information acquired in the plurality of scenes is the same, or the difference between the number of status information acquired in the plurality of scenes is within a second preset range.

In the embodiment of the application, the difference between the number of the state information acquired in any one of the at least two scenes and the number of the state information acquired in any one of the other at least two scenes is within the second preset range, so that the number of training data (for example, driving behavior decision information obtained after decision is made according to the state information) obtained in each scene is more balanced, and the problem that the driving behavior decision model is over-fitted in a certain scene is avoided.

The value of the second preset range in the above embodiment may be determined according to practical situations, which is not limited in the embodiment of the present application.

In the embodiment of the application, the learning efficiency of the driving behavior decision model can be improved by using high-quality driving behavior decision information (for example, the driving behavior decision information meeting the second preset condition).

It should be noted that, the driving behavior decision information for evaluating whether the driving behavior decision information satisfies the second preset condition may be executed by the vehicle or the server, which is not limited in the embodiment of the present application.

For example, the vehicle may send the driving behavior decision information obtained by the decision to a server, and then the server evaluates whether the driving behavior decision information meets the second preset condition, so as to screen out driving behavior decision information that meets the second preset condition.

Or, the vehicle may evaluate whether the driving behavior decision information meets the second preset condition, so as to screen out driving behavior decision information meeting the second preset condition, and then send the driving behavior decision information meeting the second preset condition to a server.

S540, adjusting parameters of the driving behavior decision model according to the driving behavior decision information and the first parameters.

Optionally, the adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter may include:

based on a reinforcement learning method, adjusting parameters of the driving behavior decision model according to the driving behavior decision information to obtain second parameters; and adjusting the second parameter of the driving behavior decision model according to the first parameter.

Optionally, the driving behavior decision model may include a first model and a second model.

Correspondingly, the reinforcement learning method adjusts the parameters of the driving behavior decision model according to the driving behavior decision information to obtain second parameters, which may include:

Based on a reinforcement learning method, adjusting parameters of the first model according to the driving behavior decision information to obtain the second parameters; and under the condition that the first preset condition is met, updating the parameters of the second model to the second parameters.

The first preset condition may be a preset time interval or a preset number of times of parameter adjustment of the first model.

Wherein updating the parameters of the second model to the second parameters may refer to: directly updating all parameters of the second model into the second parameters; alternatively, it may also mean: the partial parameters of the second model (the remaining parameters of the second model may be determined according to other methods) are updated to the second parameters, which is not limited in the embodiment of the present application.

It should be noted that, satisfying the first preset condition may mean: a preset time interval from the moment of last updating of the parameters of the second model; or, meeting the first preset condition may also mean: the number of times of decision making by using the driving behavior decision model reaches a preset number of times; or, meeting the first preset condition may also mean meeting other preset conditions, which is not limited in the embodiment of the present application.

Further, the adjusting the second parameter of the driving behavior decision model according to the first parameter may include: and adjusting parameters of the first model and/or parameters of the second model according to the first parameters.

For example, the second parameters of the first model and the second parameters of the second model may be updated simultaneously according to the first parameters of the simulated learning model; alternatively, the second parameter of the first model may be updated according to the first parameter of the simulated learning model, and then the second parameter of the second model may be updated according to the parameter of the first model if the second preset condition is satisfied.

Optionally, the method 500 may further include: and controlling the vehicle according to the driving behavior decision information.

In the embodiment of the application, the driving behavior decision model is trained, and the vehicle can be controlled according to the driving behavior decision information at the same time, so that the driving behavior decision model can be trained in the process of using the driving behavior decision model, and the driving behavior decision model is continuously optimized.

The following describes in detail the implementation flow of the method for training the driving behavior decision model in the embodiment of the present application with reference to fig. 6.

Fig. 6 is a schematic flow chart of a method 600 of training a driving behavior decision model provided by an embodiment of the application.

The method 600 shown in fig. 6 may include the steps 610, 620 and 630, and it should be understood that the method 600 shown in fig. 6 is only exemplary and not limiting, and that the method 600 may include more or less steps, which are not limited in this embodiment, and the following detailed descriptions of the steps are provided.

The method 600 shown in fig. 6 may be performed by the processor 330 in the server 320 in fig. 4.

S610, driving behavior decision information sent by the vehicle is received.

The driving behavior decision information may be obtained after the vehicle makes a decision according to the state information of the vehicle by using a driving behavior decision model.

For a specific description of the driving behavior decision information, the state information, and the driving behavior decision model, reference may be made to the embodiment of the method 500 of fig. 5, which is not repeated herein.

S620, training a simulated learning model according to the driving behavior decision information based on a simulated learning method to obtain a first parameter of the model learning model.

The first parameter is used for adjusting the parameter of the driving behavior decision model.

Further, the training of the model learning model according to the driving behavior decision information based on the model learning method to obtain the first parameter of the model learning model may include:

based on a simulated learning method, training a simulated learning model according to the driving behavior decision information meeting a second preset condition to obtain a first parameter of the model learning model.

Optionally, the second preset condition may include that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

Optionally, the second preset condition may further include that noise of the state information is within a first preset range.

Alternatively, the state information may be one of a plurality of state information, and the second preset condition may further include that the plurality of state information is acquired in a plurality of scenes.

Optionally, the second preset condition may further include: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

For a specific description of the second preset condition, reference may be made to the embodiment of the method 500 in fig. 5, which is not repeated here.

For example, the vehicle may evaluate whether the driving behavior decision information satisfies the second preset condition to screen out driving behavior decision information satisfying the second preset condition, and then send the driving behavior decision information satisfying the second preset condition to a server.

S630, the first parameter is sent to the vehicle.

Optionally, the method 600 may further include:

training the simulated learning model by using data output by a decision expert system based on a simulated learning method to obtain a third parameter of the simulated learning model, wherein the third parameter is used for determining an initial parameter of the driving behavior decision model, and the decision expert system is designed according to driving data of a driver and dynamics characteristics of a vehicle; and sending the third parameter to the vehicle.

Fig. 7 is a schematic flow chart of a method 600 of training a driving behavior decision model provided by an embodiment of the application.

The method 700 shown in fig. 7 may include the steps 710, 720, 730, and 740, and it should be understood that the method 700 shown in fig. 7 is merely exemplary and not limiting, and that the method 700 may include more or less steps, which are not limited in this embodiment, and the following detailed descriptions of the steps are provided.

The various steps in method 700 may be performed by a vehicle (e.g., processor 113 in vehicle 100 in fig. 1 or processor 103 in the autopilot system in fig. 2) or a server (e.g., processor 330 in server 320 in fig. 4), respectively, as embodiments of the present application are not limited in this regard.

By way of example and not limitation, in the following embodiments of method 700, steps 710, 720, and 730 are performed by a server, and step 740 is performed by a vehicle.

S710, designing an expert system.

For example, the server may collect driving data of the vehicle, which may include driving data of the driver and dynamics data of the vehicle (e.g., dynamics characteristics of the vehicle may be determined based on the dynamics data); a rule-based expert system is designed based on the driving data of the vehicle, which expert system can make driving behavior decisions.

S720, constructing a training data set.

For example, as shown in FIG. 7, the server may collect decision information generated by the expert system of the design S710 and label (i.e., tag the data to use the data to model the neural network for learning) the collected decision information to construct a training dataset.

For another example, as shown in fig. 7, the server may also collect decision information generated by the reinforcement learning system designed in S740, screen out (generated by the reinforcement learning system) the quality decision information in the decision information, and label the quality decision information to construct the training dataset.

The description of the quality decision information and the method for determining the quality decision information can be referred to the embodiment of the method 500 in fig. 5, and are not described herein.

S730, designing a simulated learning system.

The simulated learning system may be designed according to a scheme of a Softmax classifier based on a radial basis neural network (radial basis function neural network, RBFNN), for example, the simulated learning system may be trained offline based on a small batch random gradient descent algorithm using the training dataset constructed in S720, so as to achieve cloning of the behavior of the expert system by the simulated learning system.

The cloning herein can be understood as: the simulated learning system is trained offline such that the performance (or effect) of the decision information generated by the simulated learning system is no less than the performance (or effect) of the decision information generated by the expert system or the performance (or effect) of the decision information generated by the simulated learning system approximates the performance (or effect) of the decision information generated by the expert system.

S740, designing a reinforcement learning system.

The reinforcement learning system may be designed according to a reinforcement learning neural network-based scheme.

For example, a model strategy (i.e., model parameters) learned by a simulated learning system may be used as an initial strategy (i.e., initial parameters of a model) of the reinforcement learning system; predicting state information of the vehicle at the next moment (or the next n moments, n being a positive integer) based on the current state of the vehicle (and/or the current action of the vehicle) by combining the dynamic model of the vehicle and the dynamic model of the vehicle, wherein the state information can comprise all possible driving behaviors at a certain moment; the reinforcement learning system is used for estimating Q values corresponding to a plurality of different driving behaviors included at a certain moment, and the driving behavior corresponding to the maximum Q value is used as decision information of the moment (the driving behavior decision information output by the reinforcement learning system comprises the decision information of the moment).

The reinforcement learning system may include two networks, a current network and a target network, respectively, which may employ the same RBFNN structure as the imitation learning system.

The state information predicted by combining the dynamics model of the vehicle and the kinematics model of the vehicle may include state information of the vehicle at one or more later time points.

In the case where the state information includes state information of a plurality of times, decision information of each of the plurality of times may be estimated using the reinforcement learning system, and at this time, the driving behavior decision information output by the reinforcement learning system may include the decision information of the plurality of times.

Through the steps, the reinforcement learning system is designed, the reinforcement learning system can output the driving behavior decision information, and the vehicle can be controlled based on the driving behavior decision information.

In an embodiment of the present application, as shown in fig. 7, the steps in the method 700 may be iteratively performed continuously, thereby implementing continuous online learning of the reinforcement learning system.

For example, the steps in the method 700 may be iteratively performed as follows:

The vehicle can send driving behavior decision information generated by the reinforcement learning system to a server;

correspondingly, the server can determine high-quality decision information in the driving behavior decision information, update the determined high-quality decision information to the training data set, and perform offline training on the imitation learning system based on the updated training data set;

the server may periodically send model policies (i.e., model parameters) that mimic the learning system to the vehicle;

accordingly, after the vehicle receives the model strategy (i.e., model parameters) of the simulated learning system, the model strategy (i.e., model parameters) of the reinforcement learning system may be updated based on the received model strategy;

next, the vehicle may continue to send the driving behavior decision information generated by the reinforcement learning system to a server; the server may continue to train the simulated learning system offline based on the driving behavior decision information; the server may continue to periodically send model policies (i.e., model parameters) of the simulated learning system to the vehicle to update the model policies (i.e., model parameters) of the reinforcement learning system.

The various steps in the method 700 may be iteratively repeated in the manner described above.

The vehicle may update the model strategy of the reinforcement learning system based on the received model strategy, either by directly using the model strategy to replace the model strategy of the reinforcement learning system or by using the model strategy to replace the model strategy of the reinforcement learning system in proportion, for example, 70% of the model strategy and 30% of the model strategy of the reinforcement learning system are used as the model strategy of the reinforcement learning system.

In the iterative process in the embodiment of the application, the reinforcement learning system can be continuously improved through reinforcement learning, the vehicle can be better monitored through a server (or a cloud), and the reinforcement learning system is regularly regulated through an off-line training imitation learning system, so that the performance of the automatic driving vehicle can be continuously improved from two dimensions (on-line and off-line).

Fig. 8 is a schematic flow chart of a method 800 of training a driving behavior decision model provided by an embodiment of the application.

The method 800 shown in fig. 8 may include the steps 810, 820, 830 and 840, and it should be understood that the method 800 shown in fig. 8 is merely exemplary and not limiting, and that more or less steps may be included in the method 800, which are not limited in the embodiments of the present application, and the following detailed descriptions of the steps are provided.

The various steps in method 800 may be performed by a vehicle (e.g., processor 113 in vehicle 100 in fig. 1 or processor 103 in the autopilot system in fig. 2) or a server (e.g., processor 330 in server 320 in fig. 4), respectively, as not limited in this embodiment of the application.

By way of example and not limitation, in the following embodiments of method 800, the following description will take the example of the server executing step 810, step 820, and step 830, and the vehicle executing step 840.

S810, designing an expert system.

Alternatively, expert systems may be used to coordinate (decide) the motion control systems of an autonomous vehicle, which may include an emergency braking control system, a direct yaw moment control system, and a steering control system.

In the embodiment of the present application, the expert system may also be used to determine other systems or other states of the vehicle, for example, the expert system may also be used to coordinate (or determine) the speed, acceleration, or steering angle of the vehicle, etc., which is not limited in this embodiment of the present application.

As shown in fig. 8, the server may receive (or periodically receive) driving data of the vehicle (which may refer to driving data of a professional driver, for example, an example in which an excellent driver performs an emergency collision avoidance operation) and dynamics data of the vehicle (for example, dynamics characteristics of the vehicle may be determined based on the dynamics data) transmitted from the vehicle.

The following describes in detail an example of a motion control system for an expert system for coordinating (deciding) an autonomous vehicle.

By analyzing the driving data of the vehicle and the dynamics data of the vehicle, the rule-based expert system can be designed as follows:

a: in the linear braking process, the emergency braking control system works, but the direct yaw moment control system and the steering control system do not work;

b: in the turning avoidance process, when the lateral acceleration of the automobile is smaller than or equal to a preset threshold value, the emergency braking control system and the steering control system work, but the direct yaw moment control system does not work;

wherein the preset threshold may be 0.4 times the gravitational acceleration (gravitational acceleration, g), i.e. the preset threshold = 0.4g.

c: in the turning avoidance process, when the lateral acceleration of the vehicle is larger than the preset threshold value, the direct yaw moment control system and the steering control system work, and the emergency braking control system does not work;

d: when the collision avoidance task is completed, the emergency braking control system, the direct yaw moment control system and the steering control system do not work.

Those skilled in the art will recognize that, in the above-described rule, the failure of the steering control system means that the vehicle is in a straight running state.

The pseudocode of the rule-based expert system described above may be as shown in table 1 below:

TABLE 1 pseudo code for rule based expert System

The kinematic state of the vehicle may include a pre-aiming deviation, a path tracking deviation, a course angle, etc., the kinetic state of the vehicle may include a vehicle speed, a yaw rate, a lateral acceleration, a longitudinal acceleration, a lateral deflection angle, etc., and the environmental perception system information may include a distance from a surrounding vehicle, a speed of the surrounding vehicle, etc., a course angle of the surrounding vehicle, etc.

At this time, decision information for coordinating (deciding) the motion control system of the automatically driven vehicle can be generated by inputting the kinematic state of the vehicle, the kinetic state of the vehicle, and the environmental perception system information (perceived by the vehicle) into the expert system.

S820, constructing a training data set.

As shown in fig. 8, the server may collect decision information generated by the expert system and quality decision information generated by the reinforcement learning system, and label the collected decision information (including decision information generated by the expert system and quality decision information generated by the reinforcement learning system) to construct a training data set.

S830, designing a simulated learning system.

In the embodiment of the application, a simulated learning system can be designed based on a Softmax classifier and a small batch random gradient descent method so as to clone the behavior of an expert system.

For example, the simulated learning system may be designed as follows:

a: the output of the neural network is designed.

Alternatively, the neural network may be a Softmax classifier, and the decision information output by the neural network may be consistent with decision information generated by a rule-based expert system.

The decision information output by the neural network (similar to the decision information generated by the expert system) may be used for the mode of operation of the coordinated autonomous vehicle motion control system.

For example, the combination of autonomous vehicle emergency collision avoidance actions can be grouped together into the following categories:

the sequence number "1" may indicate that only the steering control system works, the sequence number "2" may indicate that the steering control system works together with the direct yaw moment control system, the sequence number "3" may indicate that the steering control system works together with the emergency braking control system, the sequence number "4" may indicate that the steering control system, the direct yaw moment control system and the emergency braking control system work in combination, and the sequence number "0" may indicate that all of the steering control system, the direct yaw moment control system and the emergency braking control system do not work.

Accordingly, the neural network may output any of the sequence numbers described above.

b: the cost function used in training is designed.

Alternatively, the cost function may be defined using a cross entropy method, for example, the cost function may be L _i ＝-y _i ln(P _i )。

The cost function defined by the cross entropy method can improve learning efficiency and effect.

c: the structure and input of the neural network are determined.

The network structure of the neural network can refer to a radial basis function neural network (radial basis function neural network, RBFNN).

For example, RBFNN learning may be utilized to approximate the Q value of a Softmax classifier.

As shown in fig. 9, RBFNN may include three inputs, respectively, a projection offset (or a pretighting offset) e _p Vehicle yaw rate γ and reciprocal of running vehicle speed

RBFNN may include a single hidden layer h consisting of 11 Gaussian kernel functions ₁ ～h ₁₁ RBFNN can output 4 vectors with Q valuesThe network structure of RBFNN can be as shown in fig. 9. />

The expression of RBFNN can be:

wherein,an output representing a neural network; θ represents a neural network weight matrix; h (x) = [ h ] _i ] ^T Represents the basis function vector, i represents the number of hidden layer nodes of the neural network, and h _i Representing a Gaussian function, +. >c _i A center vector representing a neural node; b _i The width of the gaussian function representing the neural node; x represents the input vector of the neural network, +.>The elements are respectively the projection deviation e _p Vehicle yaw rate gamma and reciprocal of longitudinal vehicle speed +.>

d: a gradient calculation formula is derived.

For example, the gradient of the total cost function of the neural network may beThe gradient of the total cost function with respect to the weight W of the neural network is +.>

Wherein P is _i The probability value output for the softmax classifier,y _i for the tag value, Q _i And Q is equal to _k All are reinforcement learning state-action value functions, N is the total number of classes of the sample, h is a Gaussian kernel function, and i and k are positive integers.

The small batch random gradient descent algorithm may employ the following gradients:

wherein M is ₀ A batch number falling in a random gradient for small batches, n being greater than or equal to 1 and less than or equal to M ₀ Is a positive integer of (a).

Alternatively, the neural network is trained offline by adopting a small batch random gradient descent method, so that the behavior of the rule-based driving behavior decision system can be cloned.

S840, designing a reinforcement learning system.

For example, the reinforcement learning system may be designed as follows:

a: an initial policy is determined.

And taking the model strategy (namely the model parameters) obtained by the imitation learning system as the initial strategy (namely the initial parameters of the model) of the reinforcement learning system so as to improve the efficiency and effect of driving behavior decision.

For example, the markov decision process (Markov decision process, MDP) state of the design may be s= [ e _p ,γ,v _x ^-1 ] ^T The motion space may be a= [1,2,3,4 ]] ^T 。

b: an immediate rewards function is determined.

The designed immediate rewards function may be r= -S ^T KS, wherein K is a matrix of reward weights.

c: the network structure is determined.

In contrast, the three inputs to the target network are the results of predictions by a vehicle prediction model (e.g., a dynamics model and a kinematics model of the vehicle).

d: and (5) designing optimization indexes and gradients.

The design optimization index can beThe formula for the gradient may be:

wherein Q is ^* As a function of the optimum value of the function,as an approximation function, gamma _rl For the discount factor, a' is the action performed to maximize the Q value at the t-th iteration, ++>θ for the state estimated at the next time _t 'is a target network parameter, x' is an input at the next moment, r is a reward function, and t is a positive integer.

e: a vehicle predictive model is determined.

The vehicle prediction model may be represented as follows:

wherein x' is the state of the predicted t+1 moment, y is the system output, A is the state matrix,b is an input matrix, ">x is a state vector, x= [ βγe _p Δψ e _v ] ^T U is the input vector, u= [ delta ] _t M _c F _xa ] ^T W is the interference vector, "> v _x For longitudinal speed of vehicle x _p For the pretightening distance, β is the centroid slip angle, γ is the yaw rate, e _p For the pre-aiming deviation, deltapsi is the course angle deviation, e _v For speed deviation, delta _t For the front wheel rotation angle M _c For yaw-rate control moment, F _xa For the longitudinal force of the vehicle, K is the road curvature, < ->C is the distance travelled _f C is the cornering stiffness of the front wheel _r Is the cornering stiffness of the rear wheels, a is the distance between the front axles of the vehicle, b is the rear wheelbase of the vehicle, J _z And m is the mass of the vehicle and is the moment of inertia of the vehicle.

Thus, the vehicle prediction model is:

S _t+1 ＝f(S _t ,A _t )

wherein S is _t+1 In the state at time t+1, S _t In the state of t time, A _t Action at time T, T _s To predict the time domain, e _p In order to pre-address the deviation,is the derivative of the pre-sighting deviation, gamma is the yaw rate, +.>To differentiate yaw rate, v _x For the longitudinal vehicle speed, Is the derivative of the longitudinal vehicle speed.

f: and predicting the corresponding action at each moment.

For example, in combination with the vehicle prediction model, state information of the vehicle at the next time (or the next n times, n being a positive integer) may be predicted based on the current state of the vehicle (and/or the current motion of the vehicle), which may include all possible driving behaviors at a certain time; the reinforcement learning system is used for estimating Q values corresponding to a plurality of different driving behaviors included at a certain moment, and the driving behavior corresponding to the maximum Q value is used as decision information of the moment (the driving behavior decision information output by the reinforcement learning system comprises the decision information of the moment).

g: and calculating the gradient of weight updating of the reinforcement learning system.

For example, the gradient of the network weight update can be determined by combining the qualification trace and the gradient descent method:

Δθ _t ＝ρ _rl δ _t ET _t

wherein delta _t As a time-sequence difference component of the value function Q, lambda _rl As attenuation factor, gamma _rl ET as a discount factor _t For qualification trace at time t, ET _t-1 Qualification trace at time t-1, r is a reward function, ρ _rl Is a positive coefficient, and t is a positive integer.

h: updating parameters of the reinforcement learning system.

For example, it may be determined that the update formula of the weight matrix of the neural network is θ _t+1 ＝θ _t +Δθ _t +ζ _rl [θ _t -θ _t-1 ]Wherein θ _t+1 For the network coefficient at time t+1, θ _t For the network coefficient at time t, θ _t-1 Zeta is the network coefficient at time t-1 _rl Is a proportionality coefficient.

In the embodiment of the application, the high-quality data generated by the reinforcement learning system can be labeled and then added into a training data set to be provided for the imitation learning system to perform offline training.

In the method 800, S820, S830, and S840 may be iteratively performed continuously, and the reinforcement learning system continuously self-trains and improves the automatic driving system by continuously interacting with the automatic driving vehicle through offline training and online learning.

Fig. 10 is a schematic block diagram of an apparatus 1000 for training a driving behavior decision model according to an embodiment of the present application. It should be understood that the apparatus 1000 for training a driving behavior decision model shown in fig. 10 is only an example, and the apparatus 1000 of the embodiment of the present application may further include other modules or units. It should be appreciated that the apparatus 1000 for training a driving behavior decision model is capable of performing the various steps in the methods of fig. 5, 7 or 8, and will not be described in detail herein in order to avoid repetition.

The decision unit 1010 is configured to use a driving behavior decision model to make a decision according to state information of a vehicle, so as to obtain driving behavior decision information;

A transmitting unit 1020 for transmitting the driving behavior decision information to a server;

a receiving unit 1030, configured to receive a first parameter of a model for imitation learning sent by the server, where the first parameter is obtained after the server trains the model for imitation learning using the driving behavior decision information based on an imitation learning method;

and the adjusting unit 1040 is configured to adjust parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter.

Optionally, the adjusting unit 1040 is specifically configured to:

Optionally, the driving behavior decision model includes a first model and a second model; wherein, the adjusting unit 1040 is specifically configured to:

based on a reinforcement learning method, adjusting parameters of the first model according to the driving behavior decision information to obtain the second parameters; and under the condition that a first preset condition is met, updating the parameters of the second model into the second parameters, wherein the first preset condition is a preset time interval or a preset number of times of parameter adjustment of the first model.

Optionally, the adjusting unit 1040 is specifically configured to: and adjusting parameters of the first model and/or parameters of the second model according to the first parameters.

Optionally, the decision unit 1010 is specifically configured to:

Optionally, in the case that the driving behavior decision model includes a first model and a second model, the decision unit 1010 is specifically configured to:

and evaluating all possible driving behaviors by using the second model to obtain the driving behavior decision information.

Optionally, the receiving unit 1030 is further configured to:

receiving a third parameter of the simulated learning model sent by the server, wherein the third parameter is obtained after training the simulated learning model by using data output by a decision expert system based on a simulated learning method, and the decision expert system is designed according to driving data of a driver and dynamics characteristics of a vehicle;

The adjusting unit 1040 is further configured to:

and determining initial parameters of the driving behavior decision model according to the third parameters.

Optionally, the first parameter is obtained after the server trains the simulated learning model based on a simulated learning method by using the driving behavior decision information meeting a second preset condition, and the second preset condition includes that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

Optionally, the second preset condition further includes that noise of the state information is within a first preset range.

Optionally, the state information is one of a plurality of state information, and the second preset condition further includes that the plurality of state information is acquired in a plurality of scenes.

Optionally, the second preset condition further includes: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

Fig. 11 is a schematic block diagram of an apparatus 1100 for training a driving behavior decision model provided by an embodiment of the present application. It should be understood that the apparatus 1100 for training a driving behavior decision model shown in fig. 11 is only an example, and the apparatus 1100 of the embodiment of the present application may further include other modules or units. It should be appreciated that the behavior planning apparatus 1100 is capable of performing various steps in the methods of fig. 6, 7, or 8, and will not be described in detail herein to avoid repetition.

A receiving unit 1110, configured to receive driving behavior decision information sent by a vehicle, where the driving behavior decision information is obtained after the vehicle makes a decision according to state information of the vehicle using a driving behavior decision model;

the training unit 1120 is configured to train a model learning model according to the driving behavior decision information based on a model learning method, so as to obtain a first parameter of the model learning model, where the first parameter is used to adjust a parameter of the driving behavior decision model;

a transmitting unit 1130, configured to transmit the first parameter to the vehicle.

Optionally, the training unit 1020 is further configured to:

training the simulated learning model by using data output by a decision expert system based on a simulated learning method to obtain a third parameter of the simulated learning model, wherein the third parameter is used for determining an initial parameter of the driving behavior decision model, and the decision expert system is designed according to driving data of a driver and dynamics characteristics of a vehicle;

the transmitting unit 1130 is further configured to: and sending the third parameter to the vehicle.

Optionally, the training unit 1120 is specifically configured to:

based on a simulated learning method, training a simulated learning model according to the driving behavior decision information meeting a second preset condition to obtain a first parameter of the model learning model, wherein the second preset condition comprises that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

Fig. 12 is a schematic hardware structure of an apparatus for training a driving behavior decision model according to an embodiment of the present application. The apparatus 3000 for training a driving behavior decision model shown in fig. 12 (the apparatus 3000 may be a computer device specifically) includes a memory 3001, a processor 3002, a communication interface 3003, and a bus 3004. The memory 3001, the processor 3002, and the communication interface 3003 are connected to each other by a bus 3004.

The memory 3001 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 3001 may store a program that, when executed by the processor 3002, is operable by the processor 3002 to perform the steps of the method of training a driving behavior decision model according to embodiments of the present application.

The processor 3002 may employ a general-purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to implement the methods of training a driving behavior decision model of the method embodiments of the present application.

The processor 3002 may also be an integrated circuit chip with signal processing capabilities, for example, the chip shown in fig. 3. In implementation, the steps of the method of training the driving behavior decision model of the present application may be accomplished by instructions in the form of integrated logic circuits of hardware or software in the processor 3002.

The processor 3002 may also be a general purpose processor, a digital signal processor (digital signal processing, DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 3001, and the processor 3002 reads the information in the memory 3001, combines its hardware to perform the functions that the units included in the device for training a driving behavior decision model need to perform, or performs the method for training a driving behavior decision model according to the method embodiment of the present application.

The communication interface 3003 enables communications between the apparatus 3000 and other devices or communication networks using a transceiving apparatus such as, but not limited to, a transceiver. For example, the state information of the vehicle, the running data of the vehicle, and training data required in training the driving behavior decision model may be acquired through the communication interface 3003.

A bus 3004 may include a path to transfer information between various components of the device 3000 (e.g., memory 3001, processor 3002, communication interface 3003).

It is to be appreciated that the processor in embodiments of the application may be a central processing unit (central processing unit, CPU), but may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. In addition, the character "/" herein generally indicates that the associated object is an "or" relationship, but may also indicate an "and/or" relationship, and may be understood by referring to the context.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of training a driving behavior decision model, comprising:

using a driving behavior decision model to make a decision according to the state information of the vehicle to obtain driving behavior decision information, wherein the driving behavior decision information comprises a to-be-executed action of the vehicle, a to-be-selected control mode or a control system of the vehicle;

sending the driving behavior decision information to a server;

receiving a first parameter of a simulated learning model sent by the server, wherein the first parameter is obtained after the server trains the simulated learning model by using the driving behavior decision information based on a simulated learning method;

and adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameters.

2. The method of claim 1, wherein adjusting parameters of the driving behavior decision model based on the driving behavior decision information and the first parameter comprises:

based on a reinforcement learning method, adjusting parameters of the driving behavior decision model according to the driving behavior decision information to obtain second parameters;

and adjusting the second parameter of the driving behavior decision model according to the first parameter.

3. The method of claim 2, wherein the driving behavior decision model comprises a first model and a second model;

the reinforcement learning method is used for adjusting the parameters of the driving behavior decision model according to the driving behavior decision information to obtain second parameters, and comprises the following steps:

based on a reinforcement learning method, adjusting parameters of the first model according to the driving behavior decision information to obtain the second parameters;

and under the condition that a first preset condition is met, updating the parameters of the second model into the second parameters, wherein the first preset condition is a preset time interval or a preset number of times of parameter adjustment of the first model.

4. A method according to claim 3, wherein said adjusting said second parameter of said driving behaviour decision model in accordance with said first parameter comprises:

and adjusting parameters of the first model and/or parameters of the second model according to the first parameters.

5. The method according to any one of claims 1 to 4, wherein the making a decision based on the state information using a driving behavior decision model, resulting in driving behavior decision information, comprises:

predicting the driving behavior of the vehicle at one or more later moments according to the state information based on the dynamic model and the kinematic model of the vehicle to obtain all possible driving behaviors at the one or more moments;

and evaluating all possible driving behaviors by using the driving behavior decision model to obtain the driving behavior decision information.

6. The method according to claim 5, wherein, in the case where the driving behavior decision model includes a first model and a second model, the evaluating all possible driving behaviors using the driving behavior decision model to obtain the driving behavior decision information includes:

7. The method according to any one of claims 1 to 4, further comprising:

8. The method according to any one of claims 1 to 4, characterized in that the first parameter is obtained after the server trains the model for imitation learning based on an imitation learning method using the driving behavior decision information satisfying a second preset condition including that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

9. The method of claim 8, wherein the second preset condition further comprises noise of the status information being within a first preset range.

10. The method of claim 8, wherein the status information is one of a plurality of status information, and wherein the second preset condition further comprises the plurality of status information being acquired in a plurality of scenes.

11. The method of claim 10, wherein the second preset condition further comprises: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

12. A method of training a driving behavior decision model, comprising:

receiving driving behavior decision information sent by a vehicle, wherein the driving behavior decision information is obtained after the vehicle makes a decision according to the state information of the vehicle by using a driving behavior decision model, and comprises a to-be-executed action of the vehicle, a to-be-selected control mode or a control system of the vehicle;

training a simulated learning model according to the driving behavior decision information based on a simulated learning method to obtain first parameters of the model learning model, wherein the first parameters are used for adjusting the parameters of the driving behavior decision model;

And sending the first parameter to the vehicle.

13. The method according to claim 12, wherein the method further comprises:

and sending the third parameter to the vehicle.

14. The method according to claim 12 or 13, wherein the training a model learning model based on the driving behavior decision information based on a model learning method, to obtain a first parameter of the model learning model, comprises:

15. The method of claim 14, wherein the second preset condition further comprises noise of the status information being within a first preset range.

16. The method of claim 14, wherein the status information is one of a plurality of status information, and wherein the second preset condition further comprises the plurality of status information being acquired in a plurality of scenes.

17. The method of claim 16, wherein the second preset condition further comprises: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

18. An apparatus for training a driving behavior decision model, comprising:

the decision unit is used for deciding according to the state information of the vehicle by using the driving behavior decision model to obtain driving behavior decision information, wherein the driving behavior decision information comprises actions to be executed of the vehicle, a control mode to be selected of the vehicle or a control system;

a transmitting unit configured to transmit the driving behavior decision information to a server;

the receiving unit is used for receiving first parameters of the simulated learning model sent by the server, wherein the first parameters are obtained after the server trains the simulated learning model by using the driving behavior decision information based on a simulated learning method;

And the adjusting unit is used for adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameters.

19. The device according to claim 18, wherein the adjustment unit is specifically configured to:

20. The apparatus of claim 19, wherein the driving behavior decision model comprises a first model and a second model;

wherein, the adjustment unit is specifically used for:

21. The device according to claim 20, wherein the adjustment unit is specifically configured to:

22. The apparatus according to any one of claims 18 to 21, wherein the decision unit is specifically configured to:

23. The apparatus according to claim 22, wherein in case the driving behaviour decision model comprises a first model and a second model, the decision unit is specifically adapted to:

24. The apparatus according to any one of claims 18 to 21, wherein the receiving unit is further configured to:

The adjusting unit is further configured to:

25. The apparatus according to any one of claims 18 to 21, wherein the first parameter is obtained after the server trains the model for imitation learning based on an imitation learning method using the driving behavior decision information satisfying a second preset condition including that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.

26. The apparatus of claim 25, wherein the second preset condition further comprises noise of the status information being within a first preset range.

27. The apparatus of claim 25, wherein the status information is one of a plurality of status information, and wherein the second preset condition further comprises the plurality of status information being acquired in a plurality of scenarios.

28. The apparatus of claim 27, wherein the second preset condition further comprises: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

29. An apparatus for training a driving behavior decision model, comprising:

the driving behavior decision information is obtained after the vehicle makes a decision according to the state information of the vehicle by using a driving behavior decision model, and the driving behavior decision information comprises actions to be executed of the vehicle, a control mode to be selected of the vehicle or a control system;

the training unit is used for training a simulated learning model according to the driving behavior decision information based on a simulated learning method to obtain first parameters of the model learning model, wherein the first parameters are used for adjusting the parameters of the driving behavior decision model;

and the transmitting unit is used for transmitting the first parameter to the vehicle.

30. The apparatus of claim 29, wherein the training unit is further configured to:

The transmitting unit is further configured to:

and sending the third parameter to the vehicle.

31. The device according to claim 29 or 30, wherein the training unit is specifically configured to:

32. The apparatus of claim 31, wherein the second preset condition further comprises noise of the status information being within a first preset range.

33. The apparatus of claim 31, wherein the status information is one of a plurality of status information, and wherein the second preset condition further comprises the plurality of status information being acquired in a plurality of scenarios.

34. The apparatus of claim 33, wherein the second preset condition further comprises: the difference between the number of the state information acquired in any one of the plurality of scenes and the number of the state information acquired in any other one of the plurality of scenes is within a second preset range.

35. An apparatus for training a driving behaviour decision model, comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 1 to 11.

36. An apparatus for training a driving behaviour decision model, comprising a processor and a memory, the memory for storing program instructions, the processor for invoking the program instructions to perform the method of any of claims 12 to 17.

37. An automobile comprising the apparatus of any one of claims 18 to 28 or claim 35.

38. A server comprising the apparatus of any one of claims 29 to 34 or claim 36.

39. A computer readable storage medium, characterized in that it has stored therein program instructions which, when executed by a processor, implement the method of any of claims 1 to 17.

40. A chip comprising a processor and a data interface, the processor reading instructions stored on a memory via the data interface to perform the method of any one of claims 1 to 17.