WO2022205211A1

WO2022205211A1 - Method and apparatus for controlling vehicle running and vehicle

Info

Publication number: WO2022205211A1
Application number: PCT/CN2021/084731
Authority: WO
Inventors: 苏琪; 聂为然; 许明霞
Original assignee: 华为技术有限公司
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2022-10-06
Also published as: CN113226886A

Abstract

A method and apparatus for controlling vehicle running and a vehicle. The method comprises: in an automatic running mode of a vehicle, obtaining a user instruction; obtaining environment information around the vehicle; performing multi-modal understanding on the user instruction and the environment information around the vehicle, and determining a driving intention of the user; and generating an automatic running control instruction for the vehicle according to the driving intention of the user.

Description

Method, device and vehicle for controlling vehicle running

technical field

The present application relates to the field of automatic driving, and more particularly, to a method, device and vehicle for controlling the driving of a vehicle.

Background technique

Artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.

Autopilot is a mainstream application in the field of artificial intelligence. Autopilot technology relies on the cooperation of computer vision, radar, monitoring devices and global positioning systems to allow motor vehicles to achieve autonomous driving without the need for human active operation. Autonomous vehicles use various computing systems to help transport users from one location to another. Some autonomous vehicles may require some initial or continuous input from a user, such as a pilot, driver, or passenger. An autonomous vehicle permits the operator to switch from a manual operating mode to an autonomous driving mode or a mode in between. Since automatic driving technology does not require humans to drive motor vehicles, it can theoretically effectively avoid human driving errors, reduce the occurrence of traffic accidents, and improve the efficiency of highway transportation. Therefore, autonomous driving technology is getting more and more attention.

At present, the driving basis of autonomous vehicles is based on the preset destination and the surrounding environment of the vehicle obtained by various sensors, and finally sends the user to the corresponding destination through the planned route. However, during the actual driving of the vehicle, the user may have some temporary intentions that are different from driving to the destination according to the visual information around the vehicle. If you are close to the car in front, you need to keep your distance, etc. However, under the existing autonomous driving technology, if the user generates the above temporary intention, he can only temporarily take over the control of the vehicle through manual intervention, and then execute his own temporary intention. Since the vehicle has been switched to manual driving mode at this time, users can no longer enjoy the more worry-free and safer driving experience brought by autonomous driving technology. In addition, when the level of automatic driving is at Level 5 (L5) (as defined by the Society of Automotive Engineers (SAE) on the level of automation), the human intervention function of the vehicle may be canceled, which At this time, the driver will not be able to perform the above temporary intention, so that the user experience will be reduced.

Therefore, how to improve the user experience in the process of autonomous driving is an urgent problem to be solved.

SUMMARY OF THE INVENTION

The present application provides a method, device and vehicle for controlling the driving of a vehicle, which can improve the user's sense of experience in the process of automatic driving.

In a first aspect, a method for controlling the driving of a vehicle is provided, and the method for controlling the driving of a vehicle provided by the present application can be executed by an electronic device supporting the driving of the vehicle. An electronic device refers to a computer system that can be abstracted. In this application, the electronic device supporting the control of the running of the vehicle may also be referred to as the device for controlling the running of the vehicle. The device for controlling the driving of the vehicle may be the whole machine of the electronic device, or may be part of the device in the electronic device, for example: a chip related to the function of controlling the driving of the vehicle, such as a system chip. Among them, the system chip is also called system on chip (system on chip, SOC), or SOC chip. Specifically, the device for controlling the driving of the vehicle may be a terminal device or an in-vehicle device such as an in-vehicle computer, an in-vehicle machine, a mobile phone, etc. in the vehicle, or a processor, System-on-a-chip or other types of in-vehicle chips.

The method includes: in the automatic driving mode of the vehicle, acquiring user instructions; acquiring environmental information around the vehicle; performing multi-modal understanding on the user instructions and the environmental information around the vehicle to determine the user's driving intention; according to the user's driving intention, Generate autonomous driving control commands for the vehicle.

In the embodiment of the present application, in the automatic driving mode of the vehicle, the user's driving intention can be determined by acquiring user instructions and environmental information around the vehicle, and performing multi-modal understanding of the user instructions and environmental information around the vehicle; According to the user's driving intention, an automatic driving control command for the vehicle is generated. When the vehicle is driving in the automatic driving mode, the user's temporary driving intention can be executed, and the user does not need to manually take over the control to execute the temporary driving intention, so that the user's experience in the process of automatic driving can be improved.

In conjunction with the first aspect, in some implementations of the first aspect, the driving intent includes at least one intent, each of the at least one intent includes n slots, and each of the n slots includes Slot name, slot value and classification of slot value, n is greater than or equal to 0, n is an integer.

In conjunction with the first aspect, in some implementations of the first aspect, the intent includes at least one of: stopping, overtaking, decelerating, following, and turning.

With reference to the first aspect, in some implementations of the first aspect, the slot name includes at least one of: a parking position, a speed value, an overtaking or following object, and a steering orientation.

With reference to the first aspect, in some implementations of the first aspect, the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value,

Among them, the enumeration slot value indicates that the slot value is a predefined enumeration value, the text slot value indicates that the slot value is a substring in the user command or the text generated according to the user command, and the environment slot value indicates The slot value is identified in the environment information according to the content mentioned in the user instruction.

Optionally, the environment class slot value includes an image class slot value, and the image can reflect the environment around the vehicle. Therefore, the image-type slot value may indicate that the slot value is an identification made in the image information according to the content mentioned in the user instruction.

With reference to the first aspect, in some implementations of the first aspect, generating an automatic driving control instruction for the vehicle according to the user's driving intention includes: judging whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations; The intent is feasible, and the autonomous driving control instructions for the vehicle are generated.

Optionally, if the driving intention is not feasible, prompt information may be generated and sent to the user.

Optionally, the prompt information may include the reason why the driving intention is not feasible.

In the embodiment of the present application, after the driving intention is determined, it is judged whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations; if the driving intention is feasible, the automatic driving control instruction for the vehicle is regenerated. In this way, it is possible to avoid violation of traffic laws or other problems when executing the user's driving intention in the automatic driving mode, which ensures the user experience and the safety of automatic driving during the automatic driving process.

With reference to the first aspect, in some implementations of the first aspect, the user instruction includes any one or more of a user voice instruction, a user text instruction, and a user air gesture instruction.

Optionally, if the actually obtained user command is a user voice command or a user air gesture command, then in actual operation, the user voice command or the user air gesture command can be converted into a user text command, and then the text command and the user gesture command can be converted into user text commands. The multimodal understanding of the surrounding environment information can also be performed directly on the user's voice command or the user's gesture command in the air, which is not limited in this application.

With reference to the first aspect, in some implementations of the first aspect, the method further includes: sending a photographing activation signal to a photographing device to activate the photographing device to photograph environmental information around the vehicle; acquiring the environmental information around the vehicle includes: Obtain the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal.

It should be understood that the environmental information photographed by the photographing device may also be recorded as image information. However, it should be understood that in actual operation, the acquired environmental information may be not only image information captured by a photographing device, but also environmental information acquired by lidar, vehicle-mounted sensors, and/or Internet of Vehicles, etc., which is not limited in this application.

With reference to the first aspect, in some implementations of the first aspect, acquiring environmental information around the vehicle includes: acquiring environmental information around the vehicle periodically photographed by the photographing device.

With reference to the first aspect, in some implementations of the first aspect, the user's driving intention is presented to the user through an augmented reality-head-up display AR-HUD or a central control screen.

In the embodiment of the present application, the user's driving intention may be presented to the user in the form of augmented reality-head-up display AR-HUD or a central control screen, so that the user can timely judge the correctness of the multimodal understanding result.

In a second aspect, a device for controlling the driving of a vehicle is provided. The device includes an acquisition unit and a processing unit. In the automatic driving mode of the vehicle, the acquisition unit is used to acquire user instructions; the acquisition unit is further used to acquire information around the vehicle. environmental information; the processing unit is used for multimodal understanding of user instructions and environmental information around the vehicle to determine the user's driving intention; the processing unit is also used for generating automatic driving control instructions for the vehicle according to the user's driving intention.

In conjunction with the second aspect, in some implementations of the second aspect, the driving intent includes at least one intent, each intent in the at least one intent includes n slots, and each of the n slots includes Slot name, slot value and classification of slot value, n is greater than or equal to 0, n is an integer.

In conjunction with the second aspect, in some implementations of the second aspect, the intent includes at least one of: stopping, overtaking, decelerating, following, and turning.

With reference to the second aspect, in some implementations of the second aspect, the slot name includes at least one of: a parking position, a speed value, an overtaking or following object, and a steering orientation.

With reference to the second aspect, in some implementations of the second aspect, the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value, wherein the enumeration type slot value Indicates that the slot value is a predefined enumeration value, the text type slot value indicates that the slot value is a substring in the user instruction or the text generated according to the user instruction, and the environment type slot value indicates that the slot value is based on the user instruction. The mentioned content is identified in the environmental information.

With reference to the second aspect, in some implementations of the second aspect, the processing unit is further configured to: determine whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations; if the driving intention is feasible, generate an automatic driving control for the vehicle instruction.

With reference to the second aspect, in some implementations of the second aspect, the user instructions include: any one or more of user voice instructions, user text instructions, and user air gesture instructions.

With reference to the second aspect, in some implementations of the second aspect, the device further includes: a sending unit, where the sending unit is configured to send a photographing activation signal to the photographing device, so as to activate the photographing device to photograph environmental information around the vehicle; The acquiring unit is further configured to: acquire the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal.

With reference to the second aspect, in some implementations of the second aspect, the acquiring unit is further configured to: acquire environmental information around the vehicle periodically photographed by the photographing device.

In combination with the second aspect, in some implementations of the second aspect, the user's driving intention is presented to the user through an augmented reality-head-up display AR-HUD or a central control screen.

In a third aspect, a training method for a multimodal processing module is provided, including: acquiring training data, the training data includes training input data and training target data, the training input data includes user instructions and environmental information around the vehicle, and the training target data Including the driving intention corresponding to the training input data; training the multimodal processing module according to the training input data and the training target data.

In conjunction with the third aspect, in some implementations of the third aspect, the driving intent includes at least one intent, each intent in the at least one intent includes n slots, and each of the n slots includes Slot name, slot value and classification of slot value, n is greater than or equal to 0, n is an integer.

In conjunction with the third aspect, in some implementations of the third aspect, the intent includes at least one of: stopping, overtaking, decelerating, following, and turning.

With reference to the third aspect, in some implementations of the third aspect, the slot name includes at least one of: a parking position, a speed value, an overtaking or following object, and a steering orientation.

In combination with the third aspect, in some implementations of the third aspect, the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value, wherein the enumeration type slot value Indicates that the slot value is a predefined enumeration value, the text type slot value indicates that the slot value is a substring in the user instruction or the text generated according to the user instruction, and the environment type slot value indicates that the slot value is based on the user instruction. The mentioned content is identified in the environmental information.

A fourth aspect provides a training device for a multimodal processing module, including an acquisition unit and a processing unit, the acquisition unit is used to acquire training data, the training data includes training input data and training target data, and the training input data includes user instructions and the environment information around the vehicle, the training target data includes the driving intention corresponding to the training input data; the processing unit is used for training the multimodal processing module according to the training input data and the training target data.

In conjunction with the fourth aspect, in some implementations of the fourth aspect, the driving intent includes at least one intent, each intent in the at least one intent includes n slots, and each of the n slots includes a slot Bit name, slot value and classification of slot value, n is greater than or equal to 0, n is an integer.

In conjunction with the fourth aspect, in some implementations of the fourth aspect, the intent includes at least one of: stopping, overtaking, decelerating, following, and turning.

With reference to the fourth aspect, in some implementations of the fourth aspect, the slot name includes at least one of: a parking location, a speed value, an overtaking or following object, and a steering orientation.

With reference to the fourth aspect, in some implementations of the fourth aspect, the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value, wherein the enumeration type slot value Indicates that the slot value is a predefined enumeration value, the text type slot value indicates that the slot value is a substring in the user instruction or the text generated according to the user instruction, and the environment type slot value indicates that the slot value is based on the user instruction. The mentioned content is identified in the environmental information.

In a fifth aspect, another method for controlling the driving of a vehicle is provided, comprising: in an automatic driving mode of the vehicle, acquiring a user instruction; acquiring environmental information around the vehicle; and determining the user according to the user instruction and the environmental information The driving intention of the vehicle; at least according to the driving intention of the user, an automatic driving control instruction for the vehicle is generated; based on the automatic driving control instruction, the vehicle is controlled to drive.

In the embodiment of the present application, in the automatic driving mode of the vehicle, the user's driving intention can be determined by acquiring the user's instruction and the environmental information around the vehicle, and according to the user's instruction and the environmental information; of the autopilot control commands. When the vehicle is driving in the automatic driving mode, the user's temporary driving intention can be executed, and the user does not need to manually take over the control to execute the temporary driving intention, so that the user's experience in the process of automatic driving can be improved.

With reference to the fifth aspect, in some implementations of the fifth aspect, determining the user's driving intention according to the user instruction and the environment information includes: performing multimodal understanding on the user instruction and the environment information; The result of multimodal understanding determines the user's driving intention.

With reference to the fifth aspect, in some implementations of the fifth aspect, the user instruction includes at least one of a user voice instruction, a user text instruction, and a user air gesture instruction.

In conjunction with the fifth aspect, in some implementations of the fifth aspect, the driving intent includes at least one intent, each of the at least one intent includes n slots, and each of the n slots includes Slot name, slot value and classification of slot value, n is greater than or equal to 0, n is an integer.

In conjunction with the fifth aspect, in some implementations of the fifth aspect, the intent includes at least one of: stopping, overtaking, decelerating, following, and turning.

With reference to the fifth aspect, in some implementations of the fifth aspect, the slot name includes at least one of: a parking position, a speed value, an overtaking or following object, and a steering orientation.

With reference to the fifth aspect, in some implementations of the fifth aspect, the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value,

With reference to the fifth aspect, in some implementations of the fifth aspect, an automatic driving control instruction for the vehicle is generated according to the user's driving intention; including: judging whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations; The driving intention is feasible, and the automatic driving control command for the vehicle is generated.

In the embodiment of the present application, after the driving intention is determined, it is judged whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations; if the driving intention is feasible, the automatic driving control instruction for the vehicle is regenerated. Therefore, it is possible to avoid violation of traffic laws or other problems when executing the user's driving intention in the automatic driving mode, thereby ensuring the user experience in the automatic driving process and the safety of automatic driving.

With reference to the fifth aspect, in some implementations of the fifth aspect, if the user's instruction to be acquired is to acquire the user's text instruction, then before acquiring the user's text instruction, the user's natural voice instruction or the user's airspace instruction may be acquired first. Gesture commands; then convert natural voice commands or user air gesture commands into text commands.

With reference to the fifth aspect, in some implementations of the fifth aspect, the method further includes: sending a photographing activation signal to a photographing device to activate the photographing device to photograph environmental information around the vehicle; acquiring the environmental information around the vehicle includes: Obtain the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal.

With reference to the fifth aspect, in some implementations of the fifth aspect, acquiring the environmental information around the vehicle includes: acquiring the environmental information around the vehicle periodically photographed by the photographing device.

With reference to the fifth aspect, in some implementations of the fifth aspect, the user's driving intention is presented to the user through an augmented reality-head-up display AR-HUD or a central control screen.

In a sixth aspect, another apparatus for controlling the running of a vehicle is provided, including various modules capable of implementing the method for controlling the running of a vehicle in the fifth aspect or any possible implementation manner of the fifth aspect.

A seventh aspect provides a processing method for a multimodal processing module, where the multimodal processing module is obtained by training according to the third aspect or the training method in any possible implementation manner of the third aspect; the processing method includes: The multimodal processing module obtains input data, and the input data includes user instructions and environmental information around the vehicle; the multimodal processing module outputs the driving intention according to the input data.

In an eighth aspect, a multimodal processing module is provided, wherein the multimodal processing module is obtained by training according to the third aspect or the training method in any possible implementation manner of the third aspect; the multimodal processing module is obtained by training. The processing module includes: an acquisition unit for acquiring input data, where the input data includes user instructions and environmental information around the vehicle; and a processing unit for outputting driving intentions according to the input data.

In a ninth aspect, an autonomous driving vehicle is provided, including the device in the second aspect or any possible implementation of the second aspect; and/or, including the fourth aspect or any possible implementation of the fourth aspect and/or, including the above sixth aspect or the device in any possible implementation manner of the sixth aspect; and/or, including the above eighth aspect or the module in any possible implementation manner of the eighth aspect;

A tenth aspect provides a device for controlling the driving of a vehicle, characterized by comprising a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions to execute the above-mentioned first aspect or The method for controlling the driving of a vehicle in any possible implementation manner of the first aspect; and/or, calling the program instructions to execute the fifth aspect or any possible implementation manner of the fifth aspect. Another way to control the movement of a vehicle.

In an eleventh aspect, a training device for a multimodal processing module is provided, characterized in that it includes a processor and a memory, the memory is used for storing program instructions, and the processor is used for calling the program instructions to execute the above The third aspect or the method for training the multimodal processing module in any possible implementation manner of the third aspect.

A twelfth aspect provides a system, where the system includes the above-mentioned second aspect or the apparatus in any possible implementation manner of the second aspect; and/or, includes the above-mentioned sixth aspect or any possible implementation of the sixth aspect device in the manner.

Optionally, the system may be a vehicle, or may be an on-board system on a vehicle, which is not limited in this application.

A thirteenth aspect provides a computer program product containing instructions, which, when the computer program product runs on a computer, causes the computer to execute the control in the first aspect or any possible implementation manner of the first aspect A method for driving a vehicle; and/or, executing the another method for controlling the driving of a vehicle in the fifth aspect or any possible implementation manner of the fifth aspect.

A fourteenth aspect provides a computer program product containing instructions, when the computer program product runs on a computer, the computer program product causes the computer to execute the third aspect or any of the possible implementations of the third aspect. The training method of the modality processing module.

A fifteenth aspect provides a computer-readable storage medium, where the computer-readable medium stores program code for execution by a device, the program code including the first aspect or any possibility for executing the first aspect The method for controlling the driving of a vehicle in the implementation manner of the above; and/or, executing the another method for controlling the driving of a vehicle in the fifth aspect or any possible implementation manner of the fifth aspect.

A sixteenth aspect provides a computer-readable storage medium, where the computer-readable medium stores program code for execution by a device, the program code including the third aspect or any possibility for executing the third aspect. The training method of the multimodal processing module in the implementation manner of .

A seventeenth aspect provides a chip, the chip includes a processor and a data interface, the processor reads instructions stored in a memory through the data interface, and executes the first aspect or any possibility of the first aspect The method for controlling the driving of a vehicle in the implementation manner of the above; and/or, executing the another method for controlling the driving of a vehicle in the fifth aspect or any possible implementation manner of the fifth aspect.

Optionally, as an implementation manner, the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method for controlling vehicle driving in the first aspect or any possible implementation manner of the first aspect; and/or, execute the fifth aspect or any possible implementation manner of the fifth aspect. Said another method of controlling the running of a vehicle.

In an eighteenth aspect, a chip is provided, the chip includes a processor and a data interface, the processor reads an instruction stored in a memory through the data interface, and executes the third aspect or any possibility of the third aspect. The training method of the multimodal processing module in the implementation manner of .

Optionally, as an implementation manner, the chip may further include a memory, in which instructions are stored, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the training method of the multimodal processing module in the third aspect or any possible implementation manner of the third aspect.

Description of drawings

1 is a functional block diagram of a vehicle provided by an embodiment of the present application;

FIG. 2 is an exemplary diagram of an automatic driving system to which an embodiment of the present application is applicable;

FIG. 3 is an example diagram of an application of a cloud-side command to an autonomous driving vehicle according to an embodiment of the present application;

FIG. 4 is an example diagram of a method for controlling the driving of a vehicle provided by an embodiment of the present application;

FIG. 5 is an example diagram of a system architecture provided by an embodiment of the present application;

6 is an example diagram of a specific implementation provided by an embodiment of the present application;

FIG. 7 is an exemplary diagram of another specific implementation manner provided by an embodiment of the present application;

8 is an exemplary diagram of a multimodal processing method provided by an embodiment of the present application;

FIG. 9 is an exemplary diagram of another multimodal processing method provided by an embodiment of the present application;

10 is an example diagram of a training method for a multimodal processing module provided by an embodiment of the present application;

FIG. 11 is an example diagram of an application scenario provided by an embodiment of the present application;

FIG. 12 is an example diagram of a device for controlling the driving of a vehicle provided by an embodiment of the present application;

13 is a training device for a multimodal processing module provided by an embodiment of the present application;

FIG. 14 is a schematic structural diagram of an apparatus provided by an embodiment of the present application;

FIG. 15 is an example diagram of a computer program product provided by an embodiment of the present application.

Detailed ways

The technical solutions in the present application will be described below with reference to the accompanying drawings.

FIG. 1 is a functional block diagram of a vehicle provided by an embodiment of the present application. In one embodiment, the vehicle 100 is configured in a fully or partially autonomous driving mode.

For example, the vehicle 100 can control itself while in an autonomous driving mode, and can determine the current state of the vehicle and its surroundings through human manipulation, determine the possible behavior of at least one other vehicle in the surrounding environment, and determine the other vehicles perform The confidence level corresponding to the likelihood of the possible behavior, the vehicle 100 is controlled based on the determined information. When the vehicle 100 is in an autonomous driving mode, the vehicle 100 may be placed to operate without human interaction.

Vehicle 100 may include various subsystems, such as travel system 102 , sensor system 104 , control system 106 , one or more peripherals 108 and power supply 110 , computer system 112 , and user interface 116 . Alternatively, vehicle 100 may include more or fewer subsystems, and each subsystem may include multiple elements. Additionally, each of the subsystems and elements of the vehicle 100 may be interconnected by wire or wirelessly.

The travel system 102 may include components that provide powered motion for the vehicle 100 . In one embodiment, travel system 102 may include engine 118 , energy source 119 , transmission 120 , and wheels/tires 121 . The engine 118 may be an internal combustion engine, an electric motor, an air compression engine, or other types of engine combinations, such as a gasoline engine and electric motor hybrid engine, an internal combustion engine and an air compression engine hybrid engine. Engine 118 converts energy source 119 into mechanical energy.

Examples of energy sources 119 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electricity. The energy source 119 may also provide energy to other systems of the vehicle 100 .

Transmission 120 may transmit mechanical power from engine 118 to wheels 121 . Transmission 120 may include a gearbox, a differential, and a driveshaft. In one embodiment, transmission 120 may also include other devices, such as clutches. Among other things, the drive shaft may include one or more axles that may be coupled to one or more wheels 121 .

The sensor system 104 may include several sensors that sense information about the environment surrounding the vehicle 100 . For example, the sensor system 104 may include a positioning system 122 (the positioning system may be a global positioning system (GPS) system, a Beidou system or other positioning systems), an inertial measurement unit (IMU) 124, Radar 126 , laser rangefinder 128 and camera 130 . The sensor system 104 may also include sensors of the internal systems of the vehicle 100 being monitored (eg, an in-vehicle air quality monitor, a fuel gauge, an oil temperature gauge, etc.). Sensor data from one or more of these sensors can be used to detect objects and their corresponding characteristics (position, shape, orientation, velocity, etc.). This detection and identification is a critical function for the safe operation of the autonomous vehicle 100 .

The positioning system 122 may be used to estimate the geographic location of the vehicle 100 . The IMU 124 is used to sense position and orientation changes of the vehicle 100 based on inertial acceleration. In one embodiment, IMU 124 may be a combination of an accelerometer and a gyroscope.

Radar 126 may utilize radio signals to sense objects within the surrounding environment of vehicle 100 . In some embodiments, in addition to sensing objects, radar 126 may be used to sense the speed and/or heading of objects.

The laser rangefinder 128 may utilize laser light to sense objects in the environment in which the vehicle 100 is located. In some embodiments, the laser rangefinder 128 may include one or more laser sources, laser scanners, and one or more detectors, among other system components.

Camera 130 may be used to capture multiple images of the surrounding environment of vehicle 100 . Camera 130 may be a still camera or a video camera.

The control system 106 controls the operation of the vehicle 100 and its components. Control system 106 may include various elements including steering system 132 , throttle 134 , braking unit 136 , sensor fusion algorithms 138 , computer vision system 140 , route control system 142 , and obstacle avoidance system 144 .

The steering system 132 is operable to adjust the heading of the vehicle 100 . For example, in one embodiment it may be a steering wheel system.

The throttle 134 is used to control the operating speed of the engine 118 and thus the speed of the vehicle 100 .

The braking unit 136 is used to control the deceleration of the vehicle 100 . The braking unit 136 may use friction to slow the wheels 121 . In other embodiments, the braking unit 136 may convert the kinetic energy of the wheels 121 into electrical current. The braking unit 136 may also take other forms to slow the wheels 121 to control the speed of the vehicle 100 .

Computer vision system 140 may be operable to process and analyze images captured by camera 130 in order to identify objects and/or features in the environment surrounding vehicle 100 . The objects and/or features may include traffic signals, road boundaries and obstacles. Computer vision system 140 may use object recognition algorithms, Structure from Motion (SFM) algorithms, video tracking, and other computer vision techniques. In some embodiments, the computer vision system 140 may be used to map the environment, track objects, estimate the speed of objects, and the like.

The route control system 142 is used to determine the travel route of the vehicle 100 . In some embodiments, the route control system 142 may combine data from the sensors 138 , the GPS 122 , and one or more predetermined maps to determine a driving route for the vehicle 100 .

The obstacle avoidance system 144 is used to identify, evaluate, and avoid or otherwise traverse potential obstacles in the environment of the vehicle 100 .

Of course, in one example, the control system 106 may additionally or alternatively include components other than those shown and described. Alternatively, some of the components shown above may be reduced.

Vehicle 100 interacts with external sensors, other vehicles, other computer systems, or users through peripheral devices 108 . Peripherals 108 may include a wireless communication system 146 , an onboard computer 148 , a microphone 150 and/or a speaker 152 .

In some embodiments, peripherals 108 provide a means for a user of vehicle 100 to interact with user interface 116 . For example, the onboard computer 148 may provide information to the user of the vehicle 100 . User interface 116 may also operate on-board computer 148 to receive user input. The onboard computer 148 can be operated via a touch screen. In other cases, peripheral devices 108 may provide a means for vehicle 100 to communicate with other devices located within the vehicle. For example, microphone 150 may receive audio (eg, voice commands or other audio input) from a user of vehicle 100 . Similarly, speakers 152 may output audio to a user of vehicle 100 .

Wireless communication system 146 may wirelessly communicate with one or more devices, either directly or via a communication network. For example, wireless communication system 146 may use 3G cellular communications such as code division multiple access (CDMA), global system for mobile communications (GSM), general packet radio service , GPRS), or 4G cellular communications, such as long term evolution (LTE), or 5G cellular communications. The wireless communication system 146 may communicate with a wireless local area network (WLAN) using WiFi. In some embodiments, the wireless communication system 146 may communicate directly with the device using an infrared link, Bluetooth, or the like. Other wireless protocols, such as various vehicle communication systems, for example, wireless communication system 146 may include one or more dedicated short range communications (DSRC) devices, which may include communication between vehicles and/or roadside stations public and/or private data communications.

The power supply 110 may provide power to various components of the vehicle 100 . In one embodiment, the power source 110 may be a rechargeable lithium-ion or lead-acid battery. One or more battery packs of such a battery may be configured as a power source to provide power to various components of the vehicle 100 . In some embodiments, power source 110 and energy source 119 may be implemented together, such as in some all-electric vehicles.

Some or all of the functions of the vehicle 100 are controlled by the computer system 112 . Computer system 112 may include at least one processor 113 that executes instructions 115 stored in a non-transitory computer-readable medium such as memory 114 . Computer system 112 may also be multiple computing devices that control individual components or subsystems of vehicle 100 in a distributed fashion.

The processor 113 may be any conventional processor, such as a commercially available CPU. Alternatively, the processor may be a dedicated device such as an ASIC or other hardware-based processor. Although FIG. 1 functionally illustrates the processor, memory, and other elements of the computer 110 in the same block, one of ordinary skill in the art will understand that the processor, computer, or memory may actually include a processor, a computer, or a memory that may or may not Multiple processors, computers, or memories stored within the same physical enclosure. For example, the memory may be a hard drive or other storage medium located within an enclosure other than computer 110 . Thus, reference to a processor or computer will be understood to include reference to a collection of processors or computers or memories that may or may not operate in parallel. Rather than using a single processor to perform the steps described herein, some components such as the steering and deceleration components may each have their own processor that only performs computations related to component-specific functions .

In various aspects described herein, a processor may be located remotely from the vehicle and in wireless communication with the vehicle. In other aspects, some of the processes described herein are performed on a processor disposed within the vehicle while others are performed by a remote processor, including taking steps necessary to perform a single maneuver.

In some embodiments, the memory 114 may contain instructions 115 (eg, program logic) executable by the processor 113 to perform various functions of the vehicle 100 , including those described above. Memory 114 may also contain additional instructions, including instructions to send data to, receive data from, interact with, and/or control one or more of travel system 102 , sensor system 104 , control system 106 , and peripherals 108 . instruction.

In addition to instructions 115, memory 114 may store data such as road maps, route information, vehicle location, direction, speed, and other such vehicle data, among other information. Such information may be used by the vehicle 100 and the computer system 112 during operation of the vehicle 100 in autonomous, semi-autonomous and/or manual modes.

A user interface 116 for providing information to or receiving information from a user of the vehicle 100 . Optionally, the user interface 116 may include one or more input/output devices within the set of peripheral devices 108 , such as a wireless communication system 146 , an onboard computer 148 , a microphone 150 and a speaker 152 .

Computer system 112 may control functions of vehicle 100 based on input received from various subsystems (eg, travel system 102 , sensor system 104 , and control system 106 ) and from user interface 116 . For example, computer system 112 may utilize input from control system 106 in order to control steering unit 132 to avoid obstacles detected by sensor system 104 and obstacle avoidance system 144 . In some embodiments, computer system 112 is operable to provide control of various aspects of vehicle 100 and its subsystems.

Alternatively, one or more of these components described above may be installed or associated with the vehicle 100 separately. For example, memory 114 may exist partially or completely separate from vehicle 100. The above-described components may be communicatively coupled together in a wired and/or wireless manner.

Optionally, the above component is just an example. In practical applications, components in each of the above modules may be added or deleted according to actual needs, and FIG. 1 should not be construed as a limitation on the embodiments of the present application.

A self-driving car traveling on a road, such as vehicle 100 above, can recognize objects within its surroundings to determine adjustments to the current speed. The objects may be other vehicles, traffic control equipment, or other types of objects. In some examples, each identified object may be considered independently, and based on the object's respective characteristics, such as its current speed, acceleration, distance from the vehicle, etc., may be used to determine the speed at which the autonomous vehicle is to adjust.

Alternatively, the autonomous vehicle vehicle 100 or a computing device associated with the autonomous vehicle 100 (eg, computer system 112, computer vision system 140, memory 114 of FIG. 1) may be based on the characteristics of the identified objects and the state of the surrounding environment (eg, traffic, rain, ice on the road, etc.) to predict the behavior of the identified object. Optionally, each identified object is dependent on the behavior of the other, so it is also possible to predict the behavior of a single identified object by considering all identified objects together. The vehicle 100 can adjust its speed based on the predicted behavior of the identified object. In other words, the self-driving car can determine what steady state the vehicle will need to adjust to (eg, accelerate, decelerate, or stop) based on the predicted behavior of the object. In this process, other factors may also be considered to determine the speed of the vehicle 100, such as the lateral position of the vehicle 100 in the road being traveled, the curvature of the road, the proximity of static and dynamic objects, and the like.

In addition to providing instructions to adjust the speed of the self-driving car, the computing device may also provide instructions to modify the steering angle of the vehicle 100 so that the self-driving car follows a given trajectory and/or maintains contact with objects in the vicinity of the self-driving car (eg, , cars in adjacent lanes on the road) safe lateral and longitudinal distances.

Optionally, the autonomous vehicle 100 or a computing device associated with the autonomous vehicle 100 (such as the computer system 112, the computer vision system 140, and the memory 114 in FIG. 1 ) may also be based on the state of the vehicle and the detected environmental information, Predict the availability of autonomous driving on the road ahead and control the switching between autonomous and manual driving modes.

The above-mentioned vehicle 100 can be a car, a truck, a motorcycle, a bus, a boat, an airplane, a helicopter, a lawn mower, a recreational vehicle, a playground vehicle, construction equipment, a tram, a golf cart, a train, a cart, etc. The application examples are not particularly limited.

FIG. 2 is an example diagram of an automatic driving system provided by an embodiment of the present application.

The automatic driving system shown in FIG. 2 includes a computer system 101 , wherein the computer system 101 includes a processor 103 , and the processor 103 is coupled with a system bus 105 . The processor 103 may be one or more processors, each of which may include one or more processor cores. A video adapter 107, which can drive a display 109, is coupled to the system bus 105. The system bus 105 is coupled to an input/output (I/O) bus 113 through a bus bridge 111 . I/O interface 115 is coupled to the I/O bus. I/O interface 115 communicates with various I/O devices, such as input device 117 (eg, keyboard, mouse, touch screen, etc.), media tray 121, (eg, compact disc read-only) memory, CD-ROM), multimedia interface, etc.). Transceiver 123 (which can transmit and/or receive radio communication signals), camera 155 (which can capture sceneries and dynamic digital video images) and external universal serial bus (USB) interface 125 . Wherein, optionally, the interface connected to the I/O interface 115 may be a USB interface.

The processor 103 may be any conventional processor, including a reduced instruction set computing (reduced instruction set computer, RISC) processor, a complex instruction set computing (complex instruction set computer, CISC) processor or a combination of the above. Alternatively, the processor may be a dedicated device such as an application specific integrated circuit (ASIC). Optionally, the processor 103 may be a neural network processor or a combination of a neural network processor and the above-mentioned conventional processors.

Alternatively, in various embodiments described herein, computer system 101 may be located remotely from the autonomous vehicle and may communicate wirelessly with the autonomous vehicle. In other aspects, some of the processes described herein are performed on a processor disposed within the autonomous vehicle, others are performed by a remote processor, including taking actions required to perform a single maneuver.

Computer 101 may communicate with software deployment server 149 through network interface 129 . Network interface 129 is a hardware network interface, such as a network card. The network 127 may be an external network, such as the Internet, or an internal network, such as an Ethernet network or a virtual private network (VPN). Optionally, the network 127 may also be a wireless network, such as a WiFi network, a cellular network, and the like.

The hard disk drive interface is coupled to the system bus 105 . The hard drive interface is connected to the hard drive. System memory 135 is coupled to system bus 105 . Data running in system memory 135 may include operating system 137 and application programs 143 of computer 101 .

The operating system includes a parser 139 (shell) and a kernel 141 (kernel). The shell 139 is an interface between the user and the kernel of the operating system. The shell is the outermost layer of the operating system. The shell manages the interaction between the user and the operating system: waiting for user input, interpreting user input to the operating system, and processing various operating system output.

Kernel 141 consists of those parts of the operating system that manage memory, files, peripherals, and system resources. Interacting directly with hardware, the operating system kernel usually runs processes and provides inter-process communication, providing CPU time slice management, interrupts, memory management, IO management, and more.

Application 143 includes programs that control the autonomous driving of the car, for example, programs that manage the interaction of the autonomous car with obstacles on the road, programs that control the route or speed of the autonomous car, and programs that control the interaction of the autonomous car with other autonomous vehicles on the road. . Application 143 also exists on the system of deploying server 149. In one embodiment, computer system 101 may download application 143 from deploying server 14 when application 147 needs to be executed.

For example, the application 141 may be a program that controls the autonomous vehicle to activate or deactivate the assisted autonomous driving function.

Sensor 153 is associated with computer system 101 . The sensor 153 is used to detect the environment around the computer 101 . For example, the sensor 153 can detect animals, cars, obstacles and pedestrian crossings, etc. Further sensors can also detect the environment around the above-mentioned animals, cars, obstacles and pedestrian crossings, such as: the environment around animals, for example, animals appear around other animals, weather conditions, ambient light levels, etc. Alternatively, if the computer 101 is located on a self-driving car, the sensors may be cameras, infrared sensors, chemical detectors, microphones, and the like.

Computer system 112 in FIG. 1 may also receive information from or transfer information to other computer systems. Alternatively, sensor data collected from the sensor system 104 of the vehicle 100 may be transferred to another computer for processing of the data.

For example, as shown in FIG. 3, data from the computer system 312 may be transmitted via a network to a server 320 on the cloud side (which may also be referred to as the cloud) for further processing. Networks and intermediate nodes may include various configurations and protocols, including the Internet, the World Wide Web, Intranets, Virtual Private Networks, Wide Area Networks, Local Area Networks, private networks using one or more of the company's proprietary communication protocols, Ethernet, WiFi, and hypertext The hypertext transfer protocol (HTTP), and various combinations of the foregoing. Such communications may be by any device capable of transferring data to and from other computers, such as modems and wireless interfaces. For example, data such as vehicle status and environmental information are transmitted to the cloud-side server 320 for further processing. The cloud-side server can use a variety of neural network models to identify and process these data, and feed the identification results back to the computer system 312, so that The computer system 312 may determine whether the assisted autopilot function is turned on or off.

In one example, server 320 may include a server having multiple computers, such as a load balancing server farm, that exchange information with different nodes of the network for the purpose of receiving, processing, and transmitting data from computer system 312 . The server may be configured similarly to computer system 312 , with processor 330 , memory 340 , instructions 350 , and data 360 .

An automated driving system may contain several assisted automated driving functions. Such as pre-collision safety braking (pre-collision system, PCS), adaptive cruise control (adaptive cruise control, ACC), lane keeping assist (lane keeping aid, LKA), cross traffic alert (cross traffic alert, CTA), Rear cross traffic alert (RCTA), blind spot warning (BSW), off vehicle warning and traffic jam assist (TJA), etc.

At present, the driving basis of autonomous vehicles is based on the preset destination and the surrounding environment of the vehicle obtained by various sensors, and finally sends the user to the corresponding destination through the planned route. However, during the actual driving of the vehicle, the user may have some temporary intentions that are different from driving to the destination according to the visual information around the vehicle. If you are close to the car in front, you need to keep your distance, etc.

However, under the existing autonomous driving technology, if the user generates the above temporary intention, he can only temporarily take over the control of the vehicle through manual intervention, and then execute his own temporary intention. Since the vehicle has been switched to manual driving mode at this time, users can no longer enjoy the more worry-free and safer driving experience brought by autonomous driving technology. In addition, when the level of automatic driving is at Level 5 (L5) (as defined by the Society of Automotive Engineers (SAE) on the level of automation), the human intervention function of the vehicle may be canceled, which At this time, the driver will not be able to perform the above temporary intention, so that the user experience will be reduced.

In view of the above problems, the present application provides a method for controlling the driving of a vehicle, so that during the process of driving an autonomous vehicle in the automatic driving mode, if the user has a temporary intention, the user's instructions and the surrounding environment information of the vehicle can be multi-processed. Modal understanding, determine the user's driving intention, and control the motion of the vehicle according to the user's driving intention. Therefore, the user's temporary intention can be executed in the automatic driving mode, and the user's experience in the automatic driving process can be further improved.

FIG. 4 is an example diagram of a method for controlling the driving of a vehicle provided by an embodiment of the present application. It should be understood that the method shown in FIG. 4 can be applied to the vehicle shown in FIG. 1 or the automatic driving system shown in FIG. 2 . It should be understood that the method shown in FIG. 4 is performed in an automatic driving mode.

As shown in FIG. 4 , the method 400 includes steps S410 to S440, which will be described in detail below.

S410, in the automatic driving mode of the vehicle, obtain a user instruction.

Optionally, the user instruction includes: any one or more of a user's natural voice instruction (ie, a user's voice instruction), a user text instruction, and a user air gesture instruction, which is not limited in this application.

It should be understood that in the process of driving in the automatic driving mode of the vehicle, if the user has temporary intentions, such as: seeing an acquaintance on the side of the road, you need to stop temporarily and say hello to him; if you feel that you are close to the car in front, you need to distance yourself, etc. Temporary intentions can be input to related in-vehicle devices by means of user instructions. For example, the temporary intent is input into the microphone by means of natural voice instructions; for another example, the temporary intent is input into the relevant user action acquisition device by means of air gesture instructions; for example, the temporary intent is transmitted by means of text instructions It is directly input into the relevant text input device, which is not limited in this application.

Optionally, if the obtaining of the user instruction in the above step S410 is limited to obtaining the user text instruction, then in actual operation, the user text instruction may be obtained directly from the user through the relevant text entry device, or the user may be obtained from other devices first. A voice command or an air gesture command is then converted into a text command through a related device. The present application does not limit the acquisition method of the text command. Exemplarily, if the user generates a temporary intention, he can use natural speech to speak his intention to the relevant in-vehicle device (eg, a microphone) in the car. Optionally, the conversion of natural speech instructions into text instructions may be implemented by automatic speech recognition (ASR). At this time, the user's text instruction is acquired, and specifically, the text instruction may be acquired from the ASR. Exemplarily, the air gesture instruction can be converted into a text instruction by the relevant gesture recognition device.

It should be understood that, for the convenience of description, in the following embodiments, the user text instruction will be used as an example for description, but it should be understood that this does not constitute a limitation on the solution of the present application.

S420 , obtain environmental information around the vehicle.

It should be understood that the environmental information around the vehicle can be acquired through a photographing device, specifically, an image or video is acquired through the photographing device, so as to reflect the environmental information through the information in the image or video; it can also be obtained through lidar, vehicle-mounted sensors and/or vehicle This application does not limit the environmental information obtained through networking or the like. For convenience of description, in this application, the solution will be described by taking the photographing device acquiring the environmental information as an example.

It should be understood that in actual operation, the photographing device may obtain video information or image information, or may first obtain video information around the vehicle, and then obtain image information from the video, which is not limited in this application. For ease of description, in the following embodiments, the acquisition of image information captured by a photographing device is taken as an example for description, but it should be understood that this does not constitute a limitation to the present application.

Optionally, after the user instruction is acquired, a shooting activation signal may be sent to the photographing apparatus to activate the photographing apparatus to photograph image information (ie, environmental information) around the vehicle. After the photographing device photographs the surrounding image information, the surrounding image information photographed by the photographing device is acquired.

Optionally, the photographing device may periodically photograph image information around the vehicle. At this time, acquiring image information around the vehicle may include: acquiring image information around the vehicle periodically captured by a photographing device.

In this case, when performing the multimodal understanding described below, it is necessary to select an appropriate image image from the periodically captured image information around the vehicle to perform the multimodal understanding.

The suitable image information may be the image information newly captured by the photographing device, or may be image information corresponding to a specific time interval estimated according to the recognition time of natural voice commands or air gesture commands. It may also be the image information corresponding to the acquisition of the text instruction. Specifically, the selection of the image information should be carried out according to the actual situation, which is not limited in this application.

S430, perform multimodal understanding on the user's instruction and the environmental information around the vehicle, and determine the user's driving intention. or,

The above step S430 may also be: determining the user's driving intention according to the user's instruction and environmental information around the vehicle. This means that the solution of the present application does not limit the way of determining the user's driving intention according to the user's instructions and the environmental information around the vehicle. Determined by other means, which is not limited in this application. However, as a preferred solution, in the following description, the multimodal understanding of the user's instruction and the environmental information around the vehicle to determine the user's driving intention is used as an example for description.

Then, in the present application, after obtaining the user's instruction and the environmental information around the vehicle, multi-modal understanding can be performed to determine the user's driving intention. It means that step S430 can be completed in a multi-modal processing module (ie, the multi-modal processing module 540 in FIG. 5 ). The module will be described below with reference to FIG. 5 , and the process of multimodal processing will be described with reference to FIG. 8 and FIG. 9 , which will not be repeated here.

Optionally, the driving intention includes at least one intention, each intention in the at least one intention includes n slots, and each slot in the n slots includes a slot name, a slot value, and a classification of the slot value, n is greater than or equal to 0, and n is an integer.

Optionally, the intent may include at least one of: stop, overtake, slow down, follow, turn, and the like. It should be understood that other intentions may also be included in actual operations, which are not limited in this application.

Optionally, the slot name may include at least one of: a parking position, a speed value, an overtaking or following object, a turning direction, and the like. It should be understood that in actual operation, other slot names may also be included, which are not limited in this application.

Optionally, the classification of the slot value may be: an enumeration type slot value, a text type slot value or an environment type slot value.

The enumeration class slot value indicates that the slot value is a predefined enumeration value. For example: the user command is "turn right at the next intersection". At this time, there is a slot corresponding to the steering orientation. Since the steering orientation can be enumerated, for example, there are only four options for the steering orientation: left, right, straight, U-turn. At this time, the slot value of the slot "turning orientation" is "right", and the slot value can be understood as an enumeration type slot value.

The text-type slot value indicates that the slot value is a substring in the user instruction or the text generated according to the user instruction. It should be understood that the slot value at this time is a non-enumerable value. For example: the user command is "stop next to the gas station". At this time, there is a slot corresponding to the parking position. Since the parking position cannot be enumerated, at this time, the substring in the command can be used. "Beside the station" is used as a slot value, which can be understood as a text-based slot value. For another example, the user's instruction is "park at the luxurious hotel in front". At this time, there is a slot corresponding to the parking position. Since the parking position cannot be enumerated, at this time, the text generated according to the instruction can be used. "High-level hotel" is used as a slot value, which can also be understood as a text-based slot value. It should be understood that the above-mentioned descriptions are all described below by taking a user text instruction as an example. Then, the text-type slot value indicates that the slot value may be a substring in the user text instruction or text generated according to the user text instruction, and the following embodiments take this as an example.

The environment class slot value indicates that the slot value is identified in the environment information according to the content mentioned in the user instruction. Optionally, when the environmental information is acquired by the photographing device, the environmental information may be image information, then the environment-based slot value may also be an image-based slot value, and the image can reflect the environment around the vehicle. Therefore, the image-type slot value indicates that the slot value is identified in the image information according to the content mentioned in the user instruction. For example, in the scenario shown in Figure 11 below, when the user command is "drive to the blue car position and pull over to the side", there is a slot corresponding to the parking position. Since the parking position is the "blue car position", you can use The rectangular frame identifies the "blue car" in the image information (as shown in Figure 11 ). At this time, the rectangular frame is the slot value, and the slot value can be understood as the image slot value. It should be understood that the following description will be made by taking the image slot value as an example, which is not limited in this application.

It should be understood that the above-mentioned "driving intention includes at least one intention" means that the driving intention may include one intention or multiple intentions at the same time. For example, when the user instruction is "turn right at the next intersection", it includes a steering intent; when the user instruction is "turn right at the next intersection and stop", it includes a steering intent and a parking intent.

It is also mentioned above that "each intent in at least one intent includes n slots, each of the n slots includes a slot name, a slot value, and a classification of the slot value, and n is greater than or equal to 0. , n is an integer", which means that the intent may include one or more slots describing the intent, or may not include the slot. If the intent includes a slot describing the intent, each corresponding slot includes a slot name, a slot value, and a classification of the slot value.

For example, when the user's command is "stop", the representation is to stop, and there is no slot describing the intent at this time, and subsequent operations can be performed directly based on the intent.

For another example, if the user command is "stop at the gas station ahead", and there are multiple slots describing the intention (parking), the slot name, slot value, and slot corresponding to the slot can be listed according to the user command. Classification of bit values. Exemplarily, the slot name, slot value, and classification of the slot value corresponding to the first slot of the parking intent may be the parking location, the gas station ahead, and the text-based slot value, respectively; The slot name, slot value and the classification of the slot value corresponding to the two slots can be the parking position, the rectangular frame (identifying the gas station ahead in the image information), and the image slot value.

At the same time, based on this, it can be seen that in the same driving intention, there may be one slot value classification or multiple slot value classifications, and it is necessary to analyze the specific situation, and this application is not exhaustive. lift.

Optionally, the user's driving intention can be presented to the user through an augmented reality-head up display (AR-HUD) or a central control screen, so that the user can timely judge the correctness of the multimodal understanding result.

For example, when the driving intention contains the environment slot value, the AR-HUD can present the object mentioned by the user on the windshield (such as the rectangular box shown in (a) in Figure 11), or use the AR-HUD The control screen, etc. displays the objects mentioned by the user.

S440, according to the user's driving intention, generate an automatic driving control instruction for the vehicle.

Optionally, an automatic driving control instruction for the vehicle may be generated according to the above-obtained driving intention. So that the vehicle can control the vehicle according to the automatic driving control instruction in the automatic driving mode.

In the process of automatic driving, the rules of automatic driving should be obeyed, that is, driving should be carried out in combination with the surrounding environment and should not violate traffic laws.

Therefore, optionally, it is possible to first determine whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations; if the driving intention is feasible, then generate an automatic driving control instruction for the vehicle. Specifically, reference may be made to the descriptions of steps 10 and 11 in FIG. 6 below.

Optionally, if the driving intention is not feasible, prompt information may be generated and sent to the user. Optionally, the prompt information may also include the reason why the driving intention is not feasible.

Optionally, if the driving intention is feasible, the vehicle can also prompt the user through a voice broadcast, such as "parking for you"; it can also use AR-HUD or the central control screen to display the target path and the target path of the vehicle to be driven. The target position is displayed to the user (eg, dynamic arrows and boxes shown in (b) of FIG. 11 ).

Optionally, the above-mentioned method 400 may be executed on a cloud server or an edge cloud server, or may be executed in a computer system of a vehicle, which is not limited in this application.

In the embodiment of the present application, in the automatic driving mode of the vehicle, the user's driving intention can be determined by acquiring user instructions and environmental information around the vehicle, and performing multi-modal understanding of the user instructions and environmental information around the vehicle; Then, according to the user's driving intention, an automatic driving control command for the vehicle is generated. When the vehicle is driving in the automatic driving mode, the user's temporary driving intention can be executed, and the user does not need to manually take over the control to execute the temporary driving intention, so that the user's experience in the process of automatic driving can be improved.

FIG. 5 is an example diagram of a system architecture provided by an embodiment of the present application. It should be understood that the system architecture is only an example, and does not constitute a limitation to the present application. As shown in FIG. 5 , the system architecture 500 includes: a microphone 510, an automatic speech recognition (ASR) module 520, a camera 530 (ie, a photographing device), a multimodal processing module 540, a decision planning calculation module 550 and Vehicle motion control module 560 . These modules are described below.

Microphone 510: a microphone or microphone group deployed in the vehicle cockpit, used to collect audio information of the user in the cockpit, that is, the user's voice command involved in this application, which may also be referred to as the user's natural voice command.

ASR module 520: used to recognize the user's natural language instructions collected by the microphone 510, and convert the user's natural language instructions into text instructions.

Camera 530: a camera or camera group deployed on the vehicle, used to collect image information around the vehicle.

Multimodal processing module 540: mainly includes a multimodal intent recognition engine. It is used to receive the text instruction recognized by the ASR module 520 and the image information collected by the camera 530, and generate the corresponding driving intention according to the text instruction and the image information. And in some cases, the multimodal processing module 540 can also be used to control the camera 530 to collect image information, as shown in Embodiment 1 below.

Decision planning calculation module 550: used for judging the driving intention generated by the multimodal processing module 540 in combination with traffic regulations, surrounding environment and other conditions to determine whether the driving intention is feasible. The driving intent is adjusted where necessary, and vehicle control commands are generated.

Vehicle motion control module 560 : used to control the vehicle motion according to the vehicle control command from the decision planning calculation module 550 .

It should be understood that the physical deployment of the above components or modules can be deployed individually or in any combination. It should be understood that in the case of combined deployment, information forwarding between the combined modules may not be necessary.

It should be understood that all the components or modules in the above-mentioned system architecture can be deployed in the vehicle; some components or modules, such as the ASR module 520 , the multimodal processing module 540 and the decision planning calculation module 550 can also be deployed in part or in whole On the cloud server or edge cloud server, others are deployed on the vehicle, and the solution of the present application is implemented by means of vehicle-cloud interaction, which is not limited in this application.

Based on the above-mentioned system architecture 500 , the specific implementation of the present application will be described in detail below with reference to FIGS. 6 to 9 .

FIG. 6 is an example diagram of a specific implementation provided by an embodiment of the present application. As shown in FIG. 6 , the specific implementation includes steps 1 to 11, and these steps are described in detail below.

Step 1. The user issues a voice command.

During the process of automatic driving of the vehicle according to the pre-input destination, if the user on the vehicle temporarily generates a new driving intention, he or she can speak his intention to the microphone 510 in the vehicle in the form of speech.

Step 2. Send natural voice commands.

The microphone 510 sends the received natural voice instruction to the ASR module 520 .

Step 3. Voice recognition.

The ASR module 520 performs voice recognition on the received voice command, and identifies the text command corresponding to the voice command.

Step 4. Transmit user text instructions.

The ASR module 520 transmits the recognized textual instructions to the multimodal processing module 530 .

Step 5. Send a capture activation signal.

After receiving the text instruction, the multimodal processing module 530 sends a shooting activation signal to the camera 530 to activate the camera 530 to collect surrounding image information.

Step 6. Capture image information around the vehicle.

After the camera 530 receives the shooting activation signal, it shoots image information around the vehicle.

Step 7. Send image information around the vehicle.

The camera 530 sends the captured image information around the vehicle to the multimodal processing module 540 .

Step 8. Multimodal understanding based on textual instructions and image information.

The multimodal processing module 540 performs multimodal understanding based on the text instruction and image information, and obtains the user's driving intention.

It should be understood that the driving intention has been introduced in detail above, and will not be repeated here. In addition, the process of multi-modal understanding performed by the multi-modal processing module 540 will be described below in conjunction with FIG. 8 and FIG. 9 .

Step 9. Send driving intent.

The multimodal processing module 540 sends the driving intention identified in step 8 to the decision planning calculation module 550 .

Step 10. Determine if the intent is feasible.

Because the user's driving intention may not comply with the traffic laws (for example, the user requires the opposite direction of the one-way street or requests to stop at the intersection where parking is not possible, etc.); or, the user's driving intention may not be realized in the current surrounding environment; or some other circumstances lead to The user's driving intent may not be realized.

Therefore, the decision planning calculation module 550 needs to judge whether the driving intention is feasible according to the driving intention in combination with necessary information such as the surrounding environment and traffic regulations, generate prompt information according to the judgment result, and notify the user. For example, if the judgment result is infeasible, the user's driving intention cannot be executed, and the user can be informed of the reason for the inability to execute. If the judgment result is feasible, step 11 is executed.

Step 11. Adjust the driving parameters of the vehicle according to the driving intention, surrounding environment, traffic regulations and other information.

Specifically, if the judgment result in step 10 is feasible, the decision planning calculation module 550 determines the specific vehicle motion control instruction according to the driving intention, surrounding environment, traffic regulations and other necessary information, and sends it to the vehicle motion control module 560 . The vehicle motion control module 560 performs specific execution operations according to the vehicle motion control instructions.

It should be understood that, after the driving intention is completed, the control instruction of the vehicle motion may be modified according to the actual situation, so that the vehicle continues to drive in the automatic driving mode to the final destination to be reached by the user.

FIG. 7 is an example diagram of another specific implementation manner provided by an embodiment of the present application. As shown in FIG. 7 , the specific implementation includes steps 1 to 10, and these steps are described in detail below.

Step 1 to Step 4. Reference may be made to Step 1 to Step 4 in the previous implementation manner (in FIG. 6 ), which will not be repeated here.

Step 5. Periodically capture image information around the vehicle.

The camera 530 periodically captures image information around the vehicle.

Step 6. Send image information around the vehicle.

The camera 530 periodically sends the captured image information around the vehicle to the multimodal processing module 540 .

Step 7. Multimodal understanding based on textual instructions and image information.

The multi-modal processing module 540 obtains the user's driving intention based on multi-modal understanding of the text instruction and image information at an appropriate time.

The image information at the appropriate time may be the latest image information, or may be image information corresponding to a specific time interval estimated according to the recognition time of the natural language instruction.

Likewise, the driving intention has been introduced in detail above, and will not be repeated here. In addition, the process of multi-modal understanding performed by the multi-modal processing module 540 will be described below in conjunction with FIG. 8 and FIG. 9 .

Step 8 to Step 10. Reference may be made to Step 9 to Step 11 in the previous implementation (in FIG. 6 ), which will not be repeated here.

FIG. 8 is an example diagram of a multimodal processing process provided by an embodiment of the present application.

As shown in Figure 8, the multi-modal processing mainly inputs user instructions and environmental information into the multi-modal processing module, and the multi-modal understanding is carried out through the multi-modal processing module, and finally the driving intention is output.

It should be understood that the multimodal processing module is obtained through pre-training. Specifically, in the training process, user instructions (such as user voice instructions, user text instructions or user air gesture instructions), environmental information (such as image information), and corresponding driving intentions can be used as training data to perform multimodal processing. The modules are trained as shown in Figure 10. So that in the application stage of the multimodal processing module, after inputting user instructions and environmental information, the corresponding driving intention can be output.

FIG. 9 is an exemplary diagram of another multimodal processing process provided by an embodiment of the present application. In FIG. 9 , text instructions are used as user instructions, and image information is used as environmental information. It should be understood that FIG. 9 is only a structural example of the multimodal processing module shown in FIG. 8 , and does not constitute a limitation to the present application. It should be understood that, in practice, the structure of the multimodal processing module can also take other forms, and the structure of the multimodal processing module can also be composed of other processing models, networks or modules, as long as the input text instructions and images can be realized. It is enough to output the driving intention of the information. The multimodal processing process in this example will be described below with reference to FIG. 9 .

As shown in Figure 9, the multimodal processing module may include a text processing model, a convolutional neural network (CNN), an attention module att.1 and an attention module att.2. The text processing model may be a BERT model commonly used in text processing, or may be other models that can be used for text processing, which is not limited in this application. The CNN network can be a deep residual network (Deep residual network, ResNet), etc., which is not limited.

In this example, the process of the multimodal processing module for understanding the driving intent can be as follows:

After the multimodal processing module obtains the text instruction and image information, the text instruction extracts the corresponding text features through the BERT model; the image information extracts the corresponding image features through the CNN network (eg: ResNet).

The attention module att.1 is used to synthesize the text features with the image features, so as to obtain at least one intent and n slots corresponding to each intent in the at least one intent, where n is greater than or equal to 0, and n is an integer. Wherein, each of the n slots includes a slot name, a slot value and a classification of the slot value, wherein the classification of the slot value is an enumeration slot value, a text slot value or an image class Slot value (see the description of the driving intention in Figure 4).

If the slot value of a certain slot corresponding to the intent obtained by the attention module att.1 is classified as an image class slot value, then the image feature is integrated with the text feature through the attention module att.2, so as to obtain the slot value The slot value of the bit, that is, the rectangular frame of the object mentioned in the user text instruction, for example, the rectangular frame corresponding to the blue car in Figure 11.

To sum up, the information obtained by att.1 and att.2 is the driving intention.

FIG. 10 is an example diagram of a training method for a multimodal processing module provided by an embodiment of the present application. As shown in FIG. 10, the training method 1000 includes steps S1010 and S1020, and the steps are described below.

S1010, acquiring training data.

The training data includes training input data and training target data, the training input data includes user instructions and environmental information around the vehicle, and the training target data includes the driving intention corresponding to the training input data.

The driving intent includes at least one intent, each intent in the at least one intent includes n slots, and each of the n slots includes a slot name, a slot value, and a classification of the slot value, where n is greater than or equal to 0, where n is an integer.

The intent includes at least one of: stop, overtake, slow down, follow, turn, and the like.

The slot name includes at least one of: a parking position, a speed value, an overtaking or following object, a turning direction, and the like.

Slot values are classified as: enumeration type slot value, text type slot value or environment type slot value.

S1020, train a multimodal processing module according to the training input data and the training target data.

FIG. 11 is an example diagram of an application scenario provided by an embodiment of the present application. It should be understood that the application scenario shown in FIG. 11 is only an example, and does not constitute a limitation to the present application. The application scenario is described below with reference to FIG. 11 .

As shown in (a) of FIG. 11 , the user of the autonomous driving vehicle temporarily generates a new driving intention when the vehicle is driving in the autonomous driving mode according to a preset destination, and expresses a voice to the vehicle (for example, the vehicle the microphone on the top) to issue natural voice commands, such as "drive to the blue car position and pull over". Subsequently, the relevant on-board devices on the vehicle, such as the ASR module, recognize the natural language commands and convert them into text commands. Next, the device or related module on the vehicle for controlling the driving of the vehicle determines the temporary intention of the user (that is, the user needs to park on the roadside of the blue car in front) through the above method 400, and then the device or related module determines the temporary driving intention of the vehicle according to the temporary driving intention of the vehicle. Generate appropriate vehicle control commands and issue them to the vehicle. In addition, the vehicle can also provide user feedback through voice announcements and/or augmented reality-head up display (AR-HUD). As shown in (b) of Figure 11, the vehicle can prompt the user through voice broadcast, such as "stopping for you"; it can also display the target path and target location of the vehicle to be driven by AR-HUD. user.

It should be understood that this application scenario can also be understood as a user display interface, which can present the driving intention to the user, such as the rectangular frame shown in (a) in FIG. The target position for travel is shown as arrows and boxes as shown in (b) of FIG. 11 .

FIG. 12 is an example diagram of a device for controlling the driving of a vehicle provided by an embodiment of the present application. As shown in FIG. 12 , the apparatus 1200 includes an acquisition unit 1210 and a processing unit 1220 .

In the automatic driving mode of the vehicle, the obtaining unit 1210 is configured to obtain user instructions.

The acquiring unit 1210 is further configured to acquire environmental information around the vehicle.

The processing unit 1220 is configured to perform multimodal understanding on user instructions and environmental information around the vehicle to determine the user's driving intention.

The processing unit 1220 is further configured to generate an automatic driving control instruction for the vehicle according to the user's driving intention.

Optionally, the driving intent may include at least one intent, each intent in the at least one intent includes n slots, and each of the n slots includes a slot name, a slot value, and a value of the slot value. Classification, n is greater than or equal to 0, and n is an integer.

Optionally, the intent may include at least one of: stop, overtake, slow down, follow, turn, and the like.

Optionally, the slot name may include at least one of: a parking position, a speed value, an overtaking or following object, a turning direction, and the like.

Optionally, the classification of the slot value may be: an enumeration class slot value, a text class slot value or an environment class slot value, wherein the enumeration class slot value indicates that the slot value is a predefined enumeration value. , the text slot value indicates that the slot value is a substring in the user command or the text generated according to the user command, and the environment slot value indicates that the slot value is made in the environment information according to the content mentioned in the user command logo.

Optionally, the processing unit 1220 may also be used to: determine whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations; if the driving intention is feasible, generate an automatic driving control instruction for the vehicle.

Optionally, the user instruction includes any one or more of a user voice instruction, a user text instruction, and a user air gesture instruction.

Optionally, the apparatus 1200 may further include: a sending unit 1230, the sending unit 1230 may be configured to send a photographing activation signal to the photographing apparatus, so as to activate the photographing apparatus to photograph the environmental information around the vehicle;

The acquiring unit 1210 may also be configured to: acquire the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal.

Optionally, the acquiring unit 1210 may be further configured to: acquire environmental information around the vehicle periodically photographed by the photographing device.

Optionally, the user's driving intention can be presented to the user through an augmented reality-head-up display AR-HUD or a central control screen.

FIG. 13 is a training device for a multimodal processing module provided by an embodiment of the present application. As shown in FIG. 13 , the apparatus 1300 includes an acquisition unit 1310 and a processing unit 1320 .

The obtaining unit 1310 is configured to obtain training data, the training data includes training input data and training target data, the training input data includes user instructions and environmental information around the vehicle, and the training target data includes the driving intention corresponding to the training input data.

The processing unit 1320 is configured to train the multimodal processing module according to the training input data and the training target data.

Optionally, the driving intention may include at least one intention, each intention in the at least one intention includes n slots, and each slot in the n slots includes a slot name, a slot value, and a classification of the slot value. , n is greater than or equal to 0, and n is an integer.

Optionally, the slot value can be classified as: an enumeration slot value, a text slot value or an environment slot value, wherein the enumeration slot value indicates that the slot value is a predefined enumeration value. , the text slot value indicates that the slot value is a substring in the user command or the text generated according to the user command, and the environment slot value indicates that the slot value is made in the environment information according to the content mentioned in the user command logo.

FIG. 14 is a schematic structural diagram of an apparatus provided by an embodiment of the present application. The apparatus 1400 includes a processor 1402 , a communication interface 1403 and a memory 1404 .

Alternatively, one example of the apparatus 1400 may be a chip. Another example of apparatus 1400 may be a computing device.

The processor 1402, the memory 1404 and the communication interface 1403 can communicate through a bus. Executable code is stored in the memory 1404, and the processor 1402 reads the executable code in the memory 1404 to execute the corresponding method. The memory 1404 may also include other software modules required for running processes such as an operating system. The operating system can be LINUX ^™ , UNIX ^™ , WINDOWS ^™ and the like.

For example, the executable code in the memory 1404 is used to implement the method shown in FIG. 4 or FIG. 10 , and the processor 1402 reads the executable code in the memory 1404 to execute the method shown in FIG. 4 or FIG. 10 .

The processor 1402 may be a CPU. Memory 1404 may include volatile memory, such as random access memory (RAM). The memory 1404 may also include non-volatile memory (2non-volatile memory, 2NVM), such as 2read-only memory (2ROM), flash memory, hard disk drive (HDD) or solid state drive ( solid state disk, SSD).

In some embodiments of the present application, the disclosed methods may be implemented as computer program instructions encoded on a computer-readable storage medium in a machine-readable format or on other non-transitory media or articles of manufacture. 15 schematically illustrates a conceptual partial view of an example computer program product including a computer program for executing a computer process on a computing device, arranged in accordance with at least some embodiments presented herein. In one embodiment, example computer program product 1500 is provided using signal bearing medium 1501 . The signal bearing medium 1501 may include one or more program instructions 1502 that, when executed by one or more processors, may provide the functions or portions of the functions described above with respect to the methods shown in FIG. 4 or FIG. 10 . Thus, for example, with reference to the embodiment shown in FIG. 4 , one or more of the features of S410 to S440 may be undertaken by one or more instructions associated with the signal bearing medium 1501 .

In some examples, the signal bearing medium 1501 may include a computer readable medium 1503 such as, but not limited to, a hard drive, a compact disc (CD), a digital video disc (DVD), a digital tape, a memory, a read only memory (read only memory) -only memory, ROM) or random access memory (RAM), etc. In some implementations, the signal bearing medium 1501 may include a computer recordable medium 1504 such as, but not limited to, memory, read/write (R/W) CDs, R/W DVDs, and the like. In some embodiments, signal bearing medium 1501 may include communication medium 1505, such as, but not limited to, digital and/or analog communication media (eg, fiber optic cables, waveguides, wired communication links, wireless communication links, etc.). Thus, for example, the signal bearing medium 1501 may be conveyed by a wireless form of communication medium 1505 (eg, a wireless communication medium conforming to the IEEE 802.11 standard or other transmission protocol). The one or more program instructions 1502 may be, for example, computer-executable instructions or logic-implemented instructions. In some examples, the aforementioned computing device may be configured to, in response to program instructions 1502 communicated to the computing device via one or more of computer readable media 1503 , computer recordable media 1504 , and/or communication media 1505 , Provides various operations, functions, or actions. It should be understood that the arrangements described herein are for illustrative purposes only. Thus, those skilled in the art will understand that other arrangements and other elements (eg, machines, interfaces, functions, sequences, and groups of functions, etc.) can be used instead and that some elements may be omitted altogether depending on the desired results . Additionally, many of the described elements are functional entities that may be implemented as discrete or distributed components, or in conjunction with other components in any suitable combination and position.

The terms "component", "module", "system" and the like are used in this specification to refer to a computer-related entity, hardware, firmware, a combination of hardware and software, software, or software in execution. For example, a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device may be components. One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. A component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.

Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.

Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described systems, devices and units may refer to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program codes .

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

A method for controlling the driving of a vehicle, comprising:

In the automatic driving mode of the vehicle, obtain a user instruction;

obtain environmental information around the vehicle;

Perform multimodal understanding on the user instruction and the environmental information around the vehicle to determine the user's driving intention;

According to the driving intention of the user, an automatic driving control instruction for the vehicle is generated.
The method of claim 1, wherein the driving intent includes at least one intent, each intent in the at least one intent includes n slots, each of the n slots Including slot name, slot value and classification of the slot value, n is greater than or equal to 0, n is an integer.
The method of claim 2, wherein the intention includes at least one of: stopping, overtaking, decelerating, following, and turning.
The method according to claim 2 or 3, wherein the slot name includes at least one of: a parking position, a speed value, an object of overtaking or following, and a turning direction.
The method according to any one of claims 2 to 4, wherein the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value,

The enumeration slot value indicates that the slot value is a predefined enumeration value, and the text slot value indicates that the slot value is a substring in the user instruction or generated according to the user instruction Text, the environment class slot value indicates that the slot value is an identification made in the environment information according to the content mentioned in the user instruction.
The method according to any one of claims 1 to 5, wherein the generating an automatic driving control instruction for the vehicle according to the driving intention of the user comprises:

Judging whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations;

If the driving intention is feasible, an automatic driving control command for the vehicle is generated.
The method according to any one of claims 1 to 6, wherein the user instruction comprises: any one or more of a user voice instruction, a user text instruction, and a user air gesture instruction.
The method according to any one of claims 1 to 7, wherein the method further comprises:

sending a photographing activation signal to a photographing device to activate the photographing device to photograph environmental information around the vehicle;

The acquiring environmental information around the vehicle includes:

Obtain the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal.
The method according to any one of claims 1 to 7, wherein the acquiring environmental information around the vehicle comprises:

Obtain environmental information around the vehicle periodically photographed by the photographing device.
The method according to any one of claims 1 to 9, wherein the user's driving intention is presented to the user through an augmented reality-head-up display (AR-HUD) or a central control screen.
A device for controlling the driving of a vehicle, characterized in that it comprises an acquisition unit and a processing unit, and in the automatic driving mode of the vehicle,

The obtaining unit is used to obtain user instructions;

The acquiring unit is further configured to acquire environmental information around the vehicle;

The processing unit is configured to perform multimodal understanding on the user instruction and environmental information around the vehicle, and determine the user's driving intention;

The processing unit is further configured to generate an automatic driving control instruction for the vehicle according to the driving intention of the user.
12. The apparatus of claim 11, wherein the driving intent includes at least one intent, each intent of the at least one intent includes n slots, each of the n slots Including slot name, slot value and classification of the slot value, n is greater than or equal to 0, n is an integer.
13. The apparatus of claim 12, wherein the intent includes at least one of: stop, overtake, slow down, follow, and turn.
The device according to claim 12 or 13, wherein the slot name includes at least one of: a parking position, a speed value, an object of overtaking or following, and a turning direction.
The device according to any one of claims 12 to 14, wherein the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value,

The enumeration slot value indicates that the slot value is a predefined enumeration value, and the text slot value indicates that the slot value is a substring in the user instruction or generated according to the user instruction Text, the environment class slot value indicates that the slot value is an identification made in the environment information according to the content mentioned in the text user instruction.
The apparatus according to any one of claims 11 to 15, wherein the processing unit is further configured to:

Judging whether the driving intention is feasible according to the driving intention, the surrounding environment and traffic regulations;

If the driving intention is feasible, an automatic driving control command for the vehicle is generated.
The device according to any one of claims 11 to 16, wherein the user instruction comprises: any one or more of a user voice instruction, a user text instruction, and a user air gesture instruction.
The apparatus according to any one of claims 11 to 17, wherein the apparatus further comprises: a sending unit, the sending unit is configured to:

sending a photographing activation signal to a photographing device to activate the photographing device to photograph environmental information around the vehicle;

The acquisition unit is also used for:

Obtain the environmental information around the vehicle photographed by the photographing device according to the photographing activation signal.
The device according to any one of claims 11 to 17, wherein the acquiring unit is further configured to:

Obtain environmental information around the vehicle periodically photographed by the photographing device.
The device according to any one of claims 11 to 19, wherein the user's driving intention is presented to the user through an augmented reality-head-up display AR-HUD or a central control screen.
A training method for a multimodal processing module, comprising:

acquiring training data, the training data includes training input data and training target data, the training input data includes user instructions and environmental information around the vehicle, and the training target data includes the driving intention corresponding to the training input data;

The multimodal processing module is trained according to the training input data and the training target data.
21. The method of claim 21, wherein the driving intent includes at least one intent, each intent of the at least one intent includes n slots, each of the n slots Including slot name, slot value and classification of the slot value, n is greater than or equal to 0, n is an integer.
23. The method of claim 22, wherein the intent includes at least one of: stop, overtake, slow down, follow, and turn.
The method according to claim 22 or 23, wherein the slot name includes at least one of: a parking position, a speed value, an object of overtaking or following, and a turning direction.
The method according to any one of claims 22 to 24, wherein the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value,

The enumeration slot value indicates that the slot value is a predefined enumeration value, and the text slot value indicates that the slot value is a substring in the user instruction or generated according to the user instruction Text, the environment class slot value indicates that the slot value is an identification made in the environment information according to the content mentioned in the user instruction.
A training device for a multimodal processing module, comprising an acquisition unit and a processing unit,

The acquiring unit is configured to acquire training data, where the training data includes training input data and training target data, the training input data includes user instructions and environmental information around the vehicle, and the training target data includes the training input data the corresponding driving intention;

The processing unit is configured to train the multimodal processing module according to the training input data and the training target data.
27. The apparatus of claim 26, wherein the driving intent includes at least one intent, each intent of the at least one intent includes n slots, each of the n slots Including slot name, slot value and classification of the slot value, n is greater than or equal to 0, n is an integer.
28. The apparatus of claim 27, wherein the intent includes at least one of: stop, overtake, slow down, follow, and turn.
The device according to claim 27 or 28, wherein the slot name includes at least one of: a parking position, a speed value, an overtaking or following object, and a turning direction.
The device according to any one of claims 27 to 29, wherein the classification of the slot value is: an enumeration type slot value, a text type slot value or an environment type slot value,

The enumeration slot value indicates that the slot value is a predefined enumeration value, and the text slot value indicates that the slot value is a substring in the user instruction or generated according to the user instruction Text, the environment class slot value indicates that the slot value is an identification made in the environment information according to the content mentioned in the text user instruction.
A processing method for a multimodal processing module, wherein the multimodal processing module is obtained by training according to the training method described in any one of claims 21 to 25; the processing method comprises:

The multimodal processing module obtains input data, the input data includes user instructions and environmental information around the vehicle;

The multimodal processing module outputs a driving intention according to the input data.
A multimodal processing module, characterized in that, the multimodal processing module is obtained by training according to the training method described in any one of claims 21 to 25; the multimodal processing module comprises:

an acquisition unit, configured to acquire input data, where the input data includes user instructions and environmental information around the vehicle;

The processing unit is configured to output the driving intention according to the input data.
A device for controlling the driving of a vehicle, characterized in that it comprises a processor and a memory, the memory is used for storing program instructions, and the processor is used for calling the program instructions to execute any one of claims 1 to 10. method of controlling the driving of a vehicle.
A training device for a multimodal processing module, characterized in that it comprises a processor and a memory, wherein the memory is used for storing program instructions, and the processor is used for calling the program instructions to execute any one of claims 21 to 25 The training method of the multimodal processing module described in item.
An automatic driving vehicle is characterized by comprising the device for controlling the driving of the vehicle according to any one of claims 11 to 20.
A computer-readable storage medium, wherein program instructions are stored in the computer-readable storage medium, and when the program instructions are executed by a processor, the control described in any one of claims 1 to 10 is implemented A method for driving a vehicle; and/or, a training method for implementing the multimodal processing module according to any one of claims 21 to 25.