CN114327041A

CN114327041A - Multi-mode interaction method and system for intelligent cabin and intelligent cabin with multi-mode interaction method and system

Info

Publication number: CN114327041A
Application number: CN202111428469.1A
Authority: CN
Inventors: 王丹; 付晓寅; 耿雷; 陈杰; 杨松; 赵立峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-11-26
Filing date: 2021-11-26
Publication date: 2022-04-12
Anticipated expiration: 2041-11-26
Also published as: CN114327041B

Abstract

The application discloses a multi-mode interaction method and system for an intelligent cabin and the intelligent cabin with the same, and relates to the field of data processing, in particular to the technical field of voice. The specific implementation scheme is as follows: acquiring cabin state information of an intelligent cabin; mapping the cabin state information into corresponding identification texts; acquiring corresponding dialect text, dialect voice and face data and action data of the virtual life image according to the identification text; generating a rendered screen of the target avatar according to the speech term sound, the face data of the avatar, and the motion data, and displaying the rendered screen of the target avatar on the first display device and the speech text on the second display device. According to the method, the multi-mode interaction method between the human and the machine can be realized through the expression, the action, the voice and the characters of the target virtual life image, so that the human-machine interaction process in the intelligent cabin is closer to human-machine interaction.

Description

Multi-mode interaction method and system for intelligent cabin and intelligent cabin with multi-mode interaction method and system

Technical Field

The application relates to the field of data processing, in particular to the technical field of voice, and particularly relates to a multi-mode interaction method and system for an intelligent cabin and the intelligent cabin with the multi-mode interaction method and system.

Background

With the rapid development of the internet, artificial intelligence hardware and software technology, the intelligent internet automobile is a brand new automobile system and is seen in the public sight. The existing vehicle-mounted voice interaction mode mostly uses a simple vehicle-mounted voice recognition system, the interaction process is rigid, and the instruction system is single.

Disclosure of Invention

The application provides a multi-mode interaction method and system for an intelligent cabin and the intelligent cabin with the same.

According to a first aspect of the application, a multi-modal interaction method for a smart car is provided, which comprises the following steps:

acquiring cabin state information of the intelligent cabin;

mapping the cabin state information into corresponding identification texts;

acquiring corresponding dialect text, dialect voice, face data and action data of the virtual life image according to the identification text;

and generating a rendering picture of the target virtual life image according to the speech sound, the face data and the action data of the virtual life image, playing the rendering picture of the target virtual life image on a first display device, and displaying the speech text on a second display device.

According to a second aspect of the present application, there is provided a multimodal interaction system for a smart car, comprising: a master control chip, a slave chip and a local area network communication module, wherein,

the local area network communication module is used for being responsible for communication among the master control chip, the slave chip and the intelligent cabin boarding machine system;

the main control chip is used for receiving the cabin state information of the intelligent cabin transmitted from the slave chip, mapping the cabin state information into corresponding identification texts, and acquiring corresponding dialect texts, dialect voices, face data and action data of virtual life images according to the identification texts;

the master control chip is further used for generating a rendering picture of the target virtual life image according to the dialect voice, the face data and the action data of the virtual life image, controlling a first display device to play the rendering picture of the target virtual life image and sending the dialect text to the slave chip;

and the slave chip is used for receiving the speaking and art text sent by the main control chip and controlling second display equipment to display the speaking and art text.

According to a third aspect of the present application, there is provided a multi-modal interaction apparatus for a smart car, comprising:

the first acquisition module is used for acquiring the cabin state information of the intelligent cabin;

the mapping module is used for mapping the cabin state information into corresponding identification texts;

the second acquisition module is used for acquiring corresponding dialect text, dialect voice and face data and action data of the virtual life image according to the identification text;

and the control module is used for generating a rendering picture of the target virtual life image according to the speech operation voice, the face data and the action data of the virtual life image, playing the rendering picture of the target virtual life image on first display equipment and displaying the speech operation text on second display equipment.

According to a fourth aspect of the application, there is provided a smart cabin comprising: the multimodal interaction system of the second aspect described above.

According to a fifth aspect of the application, there is provided a smart cabin comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the multimodal interaction method of the first aspect.

According to a sixth aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the multimodal interaction method of the first aspect.

According to a seventh aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a multimodal interaction method according to the preceding first aspect.

According to the technical scheme of the application, the corresponding dialect text, the corresponding dialect voice, the face data and the action data of the virtual life image are obtained according to the cabin state information of the intelligent cabin, and the rendering picture of the target virtual life image is generated. And displaying the dialect text on the second display equipment by playing the rendering picture of the target virtual life image on the first display equipment so as to actively guide the user and remind the user of the current cabin state. The multi-mode interaction method between the human and the machine is realized through the expression, the action, the voice and the character of the target virtual life image, so that the human-machine interaction process is closer to human-machine interaction and is more natural and vivid.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a schematic flowchart of a multi-modal interaction method for an intelligent cabin according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a multi-modal interaction method for an intelligent cockpit according to a second embodiment of the present application;

fig. 3 is a schematic flowchart of a multi-modal interaction method for an intelligent cockpit according to a third embodiment of the present application;

fig. 4 is a schematic flowchart of a multi-modal interaction method for an intelligent cockpit according to the fourth embodiment of the present application;

fig. 5 is a schematic diagram of a multi-modal interaction system of a smart car according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a full-duplex multi-modal interaction system according to a sixth embodiment of the present application;

fig. 7 is a schematic diagram of a multi-modal interaction system of a smart car according to a seventh embodiment of the present application;

fig. 8 is a schematic diagram of a multi-modal interaction system for a smart car according to an eighth embodiment of the present application;

fig. 9 is a block diagram of a multi-modal interaction device for a smart car according to a ninth embodiment of the present application;

fig. 10 is a block diagram of a multi-modal interaction device for a smart car according to an embodiment of the present application;

fig. 11 is a block diagram of a multi-modal interaction device for a smart car according to an eleventh embodiment of the present application;

fig. 12 is a block diagram illustrating a multi-modal interaction apparatus for a smart car according to a twelfth embodiment of the present application;

fig. 13 is a schematic view of an intelligent cabin provided in an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the technical solution of the present application, the acquisition, storage, application, and the like of the personal information of the related user all conform to the regulations of the relevant laws and regulations, and do not violate the common customs of the public order.

In the related technology, most of the vehicle-mounted voice interaction modes use simple vehicle-mounted voice recognition systems, the interaction process is rigid, and the instruction system is single.

Therefore, the application provides a multi-mode interaction method, a multi-mode interaction system and a multi-mode interaction device for an intelligent cabin and the intelligent cabin with the multi-mode interaction device. The intelligent cabin can be an intelligent cabin in an automobile or an intelligent cabin in other vehicles such as a helicopter. In particular, a multi-modal interaction method, a system, a device of an intelligent cabin and an intelligent cabin with the same according to the embodiments of the present application are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a multi-modal interaction method for a smart car according to an embodiment of the present application. It should be noted that the multi-modal interaction method for a smart car according to the embodiment of the present application can be applied to the multi-modal interaction apparatus for a smart car according to the embodiment of the present application, and the multi-modal interaction apparatus for a smart car can be configured on an electronic device. The execution main body of the multi-mode interaction method for the intelligent cockpit of the embodiment can be a main control chip.

As shown in fig. 1, the multi-modal interaction method for the intelligent cockpit may include the following steps:

and step 101, acquiring cabin state information of the intelligent cabin.

Optionally, the cabin state information may be detected by a car machine system in the intelligent cabin, and may also be detected by other means, such as a sensor.

As one example, the cabin status information may include, but is not limited to, any one or more of cabin status information such as an in-vehicle temperature, a door closed condition, a seat belt wearing condition, a vehicle driving condition, and the like.

And 102, mapping the cabin state information into a corresponding identification text.

Optionally, since the cabin state information of the intelligent cabin may be represented by a code, a digital code or a state code, so as to facilitate the identification of the vehicle-mounted system, although the vehicle-mounted system can identify the intelligent cabin, the main control chip may not normally identify the intelligent cabin, so that a mapping relationship may be established between the cabin state information of the intelligent cabin and the corresponding identification text in advance, and the mapping relationship between the cabin state information and the identification text may be stored, so that the identification text corresponding to the cabin state information may be found according to the mapping relationship.

As an example, if the status code of the cabin status information of the smart cabin is "M1000", the cabin status information may be mapped to the corresponding identification text "door not closed" according to a mapping relationship stored in advance. For another example, if the status code of the cabin status information of the intelligent cabin is "M1111", the cabin status information can be mapped to the corresponding identification text "there is no abnormal cabin status and no destination set" according to the mapping relationship stored in advance. It should be noted that the state codes "M1000" and "M1111" are only given as examples for facilitating the understanding of those skilled in the art that the state codes and the identification texts described in the present embodiment have mapping relationships, and cannot be taken as a specific limitation of the present application, that is, other state codes may also be used for corresponding identification texts.

And 103, acquiring corresponding dialect text, dialect voice, face data and action data of the virtual life image according to the identification text.

As an example, the recognition text may be sent to a cloud server, and the speech text, the speech voice, the face data of the virtual life image, and the motion data corresponding to the recognition text may be acquired. For example, if the recognition text is "the door is not closed", the corresponding dialect text "the door is not closed" is obtained according to the recognition text "the door is not closed", and the door is required to be closed. ", and the phonetic speech, face data of the virtual life figure and motion data corresponding to the phonetic text. For another example, if the identification text is "no abnormality in the cabin state", no destination is set. "then, according to the identification text," there is no abnormality in the cabin state, and no destination is set. "get the corresponding dialect text" do you have not yet set a destination, ask where the destination is? ", and corresponding conversational speech, facial data and motion data of the virtual life image.

In the embodiment of the application, the virtual life figure can be a virtual human, or can also be a cartoon character and the like. In one implementation, the facial data may include, but is not limited to, any one or more of facial expressions, eye data, lip data (e.g., lip opening and closing data), and the like; the motion data may include limb motion data or the like.

And 104, generating a rendering picture of the target virtual life image according to the speech term sound, the face data and the action data of the virtual life image, playing the rendering picture of the target virtual life image on the first display device, and displaying the speech text on the second display device.

As an example, if the speech terminology voice, the face data of the virtual life figure, and the motion data are the "how-you have not set a destination yet, ask where is the destination? The corresponding dialect voice, the face data and the action data of the virtual life image can generate a rendering picture of the target virtual life image according to the dialect voice, the face data and the action data of the virtual life image, the rendering picture of the target virtual life image is played on the first display device, and the dialect text is displayed on the second display device. That is, the first display device plays a rendering screen of the target avatar, in which the target avatar proceeds with "do you not set a destination yet, ask where the destination is? "correspond to facial expressions, body movements, etc., and play corresponding conversational speech. A second display device displays the dialect text "do you not set a destination yet, ask where is the destination? "is used. Therefore, active guidance can be provided for the user by playing the rendering picture of the target virtual life image on the first display device and displaying the content of the language text on the second display device according to the cabin state information of the intelligent cabin, so that the user can be reminded of the current cabin state.

It should be noted that the first display device and the second display device may be two display devices respectively controlled by two chips, or may be different areas controlled by two chips on the same display device.

Optionally, in some embodiments of the present application, the semantic information of the corresponding cabin status may be obtained according to the recognition text. And generating a corresponding vehicle control instruction according to the cabin state semantic information. And sending a vehicle control instruction to the vehicle-mounted machine system, wherein the vehicle control instruction is used for instructing the vehicle-mounted machine system to correspondingly control the intelligent cabin. As an example, if the identification text corresponding to the cabin state information of the smart cabin is "no abnormal cabin state, the vehicle is in a driving state, and the door is not locked", the corresponding cabin state semantic information may be acquired as "vehicle locked". And generating a corresponding vehicle control instruction 'vehicle lock-down' instruction according to the cabin state semantic information. And sending the car control command to a car machine system, and carrying out car locking control on the cabin by the car machine system.

According to the multi-modal interaction method of the intelligent cabin, the corresponding dialect text, the dialect voice, the face data and the action data of the virtual life image are obtained according to the cabin state information of the intelligent cabin, and the rendering picture of the target virtual life image is generated. And displaying the dialect text on the second display equipment by playing the rendering picture of the target virtual life image on the first display equipment so as to actively guide the user and remind the user of the current cabin state. The multi-mode interaction method between the human and the machine is realized through the expression, the action, the voice and the character of the target virtual life image, so that the human-machine interaction process is closer to human-machine interaction and is more natural and vivid.

Fig. 2 is a flow chart of a multi-modal interaction method for an intelligent cockpit according to a second embodiment of the present application. As shown in fig. 2, the multi-modal interaction method for the intelligent cockpit may include the following steps:

step 201, cabin state information of the intelligent cabin is acquired.

Step 202, the cabin status information is mapped to a corresponding identification text.

Step 203, sending the identification text to a cloud server.

Step 204, obtaining the corresponding dialect text, the corresponding dialect voice, the face data of the virtual life image and the action data from the cloud server.

It should be noted that a large amount of speech texts, speech voices, face data of virtual life images and motion data corresponding to the recognition texts are stored in the cloud server.

And step 205, generating a rendered picture of the target virtual life image according to the speech term sound, the face data of the virtual life image and the action data, playing the rendered picture of the target virtual life image on the first display device, and displaying the speech text on the second display device.

In the embodiment of the present application, step 201, step 202, and step 205 may be implemented by any one of the manners in the embodiments of the present application, and this application is not specifically limited and is not described again.

According to the multi-modal interaction method of the intelligent cabin, the corresponding phonetics text, phonetics voice, face data and action data of the virtual life image are obtained from the cloud server according to the cabin state information of the intelligent cabin, and a rendering picture of the target virtual life image is generated. And displaying the dialect text on the second display equipment by playing the rendering picture of the target virtual life image on the first display equipment so as to actively guide the user and remind the user of the current cabin state. The multi-mode interaction method between the human and the machine is realized through the expression, the action, the voice and the character of the target virtual life image, the instruction is richer, the human-machine interaction process is closer to human-machine interaction, and the method is more natural and vivid.

Fig. 3 is a flow chart of a multi-modal interaction method for an intelligent cabin according to a third embodiment of the present application. As shown in fig. 3, the multi-modal interaction method for the intelligent cockpit may include the following steps:

and 301, acquiring cabin state information of the intelligent cabin.

Step 302, the cabin status information is mapped to a corresponding identification text.

Step 303, obtaining corresponding dialect text, dialect voice, face data and action data of the virtual life image according to the identification text.

And step 304, generating a rendering picture of the target virtual life image according to the speech term sound, the face data of the virtual life image and the action data, playing the rendering picture of the target virtual life image on the first display device, and displaying the speech text on the second display device.

And 305, setting the intelligent cabin to be in a full-duplex voice interaction state when the rendering picture of the target virtual life image is played on the first display device.

It should be noted that, when the rendering picture of the target virtual avatar is played on the first display device, the intelligent cabin is set to a full-duplex voice interaction state, and the intelligent cabin continuously listens to the user instruction without waking up words or other active guidance to trigger the multi-modal interaction flow, i.e., does not wake up the continuous speech. And subsequently, the collected voice information is sent to a cloud server, the cloud server gives a confidence result, and whether the voice information is effective voice information which is sent by a user and needs to be interacted with the intelligent cabin or not is judged according to the confidence result. If the voice is invalid, such as collected noise, chat voice among users in the cabin, and the like, no response needs to be made to the voice information.

And step 306, acquiring first voice information acquired by a microphone on the intelligent cabin.

That is, according to the cabin state information of the intelligent cabin, after the rendering picture of the target virtual life image is played on the first display device and the language text content is displayed on the second display device, the voice information of the user needs to be collected for human-computer interaction. For example, the first display device plays a rendered screen of the target avatar, in which the target avatar says "detect that the current temperature is too high, turn on the air conditioner? And is provided with corresponding facial expressions, limb movements, etc. A second display device displays a dialect text "does the current temperature detected to be too high, turn on the air conditioner? "is used. The user says "turn on the air conditioner. And the microphone on the intelligent cabin collects 'turn on air conditioner' and takes the 'turn on air conditioner' as first voice information.

Step 307, performing voice recognition on the first voice message to obtain corresponding first text message.

Optionally, a voice recognition technology may be adopted to perform voice recognition on the first voice information to obtain text information corresponding to the first voice information. As an example, the main control chip is embedded with a voice recognition module, and the voice recognition module performs voice recognition on the first voice information.

Step 308, obtaining first semantic information, first speech technology voice, first face data of the virtual life image and first action data corresponding to the first text information from the cloud server.

Optionally, a corresponding relationship between the text information and the semantic information may be stored in the cloud server in advance, so that the main control information may find the first semantic information corresponding to the first text information from the cloud server. For example, the text information "turn on air conditioner", and its corresponding semantic information is "good, turning on for you".

As an example, assuming that the first voice message collected by the microphone on the smart car is "turn on the air conditioner", and the corresponding first text message is "turn on the air conditioner", the first semantic message "corresponding to the first text message" turn on the air conditioner "is obtained from the cloud server and is good and being turned on for you. ", and corresponding first speech term sounds, first facial data of the avatar, and first motion data.

Step 309, generating a first rendering picture of the target avatar according to the first speech, the first face data of the avatar, and the first motion data, and playing the first rendering picture of the target avatar on the first display device.

And 310, generating a corresponding first vehicle control instruction according to the first text information and the first semantic information.

As an example, let the first text information be "turn on air conditioner", and the first semantic information be "good, being turned on for you. ". And generating a corresponding first vehicle control instruction according to the first text information and the first semantic information, namely an 'air conditioner on' instruction.

Step 311, sending the first vehicle control instruction to the vehicle-mounted machine system; the first vehicle control instruction is used for instructing the vehicle machine system to execute corresponding operation.

As an example, the first vehicle control command is a command of turning on an air conditioner, and the first vehicle control command is sent to the vehicle machine system, and the vehicle machine system performs a corresponding operation, for example, the vehicle machine system may turn on the air conditioner in the smart cabin.

In the embodiment of the present application, steps 301 to 304 may be implemented by any one of the methods in the embodiments of the present application, and the present application is not specifically limited and is not described in detail.

According to the multi-modal interaction method of the intelligent cabin, the corresponding phonetics text, phonetics voice, face data and action data of the virtual life image are obtained from the cloud server according to the cabin state information of the intelligent cabin, and a rendering picture of the target virtual life image is generated. And displaying the dialect text on the second display equipment by playing the rendering picture of the target virtual life image on the first display equipment so as to actively guide the user and remind the user of the current cabin state. And when the rendering picture of the target virtual life image is played on the first display device, setting the intelligent cabin to be in a full-duplex voice interaction state. The method comprises the steps of acquiring first voice information through a microphone, acquiring corresponding first semantic information, first speech, first face data of the virtual life image and first action data from a cloud server according to the first voice information, and generating a first rendering picture of the target virtual life image. And playing a first rendering picture of the target virtual life image on the first display equipment so as to respond to the first voice information. The multi-mode interaction method between the human and the machine is realized through expressions, actions, voice and characters of the target virtual life image. The instruction is richer, and support the interactive process and exempt from to awaken up and continue the word, make the human-computer interaction process more be close people's interaction, and is more natural, vivid, promotes user interaction and experiences.

It should be noted that, in addition to triggering the interactive flow of the intelligent cabin when active guidance needs to be given to the user according to the cabin state information, the interactive flow of the intelligent cabin can also be triggered by a preset wake-up word. Fig. 4 is a schematic flowchart of a multi-modal interaction method for triggering a smart car according to the fourth embodiment of the present application. As shown in fig. 4, on the basis of the above embodiment, the multi-modal interaction method for a smart car may further include the following steps:

step 401, acquiring second voice information acquired by a microphone on the intelligent cabin, and performing voice recognition on the second voice information to acquire corresponding second text information.

And 402, responding to the second text message containing the preset awakening word, setting the intelligent cabin to be in a full-duplex voice interaction state, and triggering a multi-mode interaction process of the intelligent cabin.

That is to say, not only the interactive flow of the intelligent cabin is triggered when active guidance needs to be given to the user according to the state information of the cabin, but also the interactive flow of the intelligent cabin can be triggered in a mode of inputting preset awakening words through voice. As an example, assuming that the wake-up word is "hello", when the user needs to turn on music, the user inputs "hello, turn on music" by voice, and the second voice information "hello, turn on music" collected by the microphone on the smart car is subjected to voice recognition to obtain the corresponding second text information "hello, turn on music". And the second text message contains a preset awakening word, namely, the intelligent cabin is set to be in a full-duplex voice interaction state, and a multi-mode interaction flow of the intelligent cabin is triggered. And sending the second text information to a cloud server, and acquiring the dialect text' good corresponding to the identification text from the cloud server, wherein music is being opened for you. And corresponding conversational speech, facial data and motion data of the virtual life image, and generating a rendered picture of the target virtual life image, and playing the rendered picture of the target virtual life image on the first display device. And generating a corresponding vehicle control instruction 'open music' according to the second text information, and sending the vehicle control instruction to the vehicle-mounted machine system, so that the vehicle-mounted machine system completes the operation of 'open music'.

According to the multi-mode interaction method of the intelligent cabin, the interaction flow of the intelligent cabin is triggered in a mode of inputting the preset awakening words through voice besides the interaction flow of the intelligent cabin is triggered when active guidance needs to be conducted to a user according to the state information of the intelligent cabin.

Fig. 5 is a schematic diagram of a multi-modal interaction system of a smart car according to an embodiment of the present application. As shown in fig. 5, the multi-modal interactive system of the smart car may include a master chip 501, a slave chip 502 and a local area network communication module 503.

Specifically, the main control chip 501 is configured to receive cabin status information of the intelligent cabin transparently transmitted from the chip, map the cabin status information into corresponding identification texts, and obtain corresponding dialect texts, dialect voices, and facial data and motion data of virtual life images according to the identification texts.

Alternatively, the main control chip 501 may send the recognition text to the cloud server to obtain the phonetic text, the phonetic voice, the face data of the virtual life image and the motion data corresponding to the recognition text. Alternatively, the main control chip 501 may further obtain the phonetic text, the phonetic voice, the face data of the virtual life figure and the motion data corresponding to the recognition text from the data cached by itself.

The master control chip 501 is further configured to generate a rendered image of the target avatar according to the speech term sound, the face data of the avatar, and the motion data, control the first display device to play the rendered image of the target avatar, and send the speech text to the slave chip.

The main control chip 501 may carry a full-duplex multi-modal interaction system, and the full-duplex multi-modal interaction system may refer to fig. 6, where fig. 6 is a schematic diagram of a full-duplex multi-modal interaction system provided according to a sixth embodiment of the present application. As shown in fig. 6, the full-duplex multi-modal interactive system is responsible for voice wakeup in a vehicle, voice recognition, voice broadcast, virtual human action, acquisition of NLP (Natural Language Processing) resources (such as weather, audio and video, alarm clock, navigation, and the like), and the like, so that the voice and intelligent interactive modes are integrated, and a virtual interactive system closer to human-human interaction is formed.

And the slave chip 502 is configured to receive the spoken text sent by the master control chip and control the second display device to display the spoken text.

The slave chip 502 mounts a video and navigation playback system. The video and navigation playing system is responsible for video playing, map real-time navigation, audio and video call display and the like of a second display device in the vehicle.

And the local area network communication module 503 is used for taking charge of communication between the master control chip and the slave chip and the intelligent cabin boarding machine system.

In some embodiments of the present application, the main control chip 501 is further configured to obtain corresponding cabin state semantic information according to the identification text; and generating a corresponding vehicle control instruction according to the cabin state semantic information, and transmitting the vehicle control instruction to the vehicle-mounted machine system through the slave chip, wherein the vehicle control instruction is used for instructing the vehicle-mounted machine system to correspondingly control the intelligent cabin.

In some embodiments of the present application, the main control chip 501 is further configured to set the intelligent cabin to be in a full-duplex voice interaction state when the rendering picture of the target virtual life image is played on the first display device, acquire the first voice information acquired by the microphone on the intelligent cabin, perform voice recognition on the first voice information, acquire corresponding first text information, acquire first semantic information, first speech, first face data of the virtual life image, and first action data corresponding to the first text information, generate a first rendering picture of the target virtual life image according to the first speech, the first face data of the virtual life image, and the first action data, and control the first display device to play the first rendering picture of the target virtual life image. The main control chip 501 is further configured to generate a corresponding first vehicle control instruction according to the first text information and the first semantic information, and transmit the first vehicle control instruction to the vehicle system through the slave chip, where the first vehicle control instruction is used to instruct the vehicle system to execute a corresponding operation.

In some embodiments of the present application, the main control chip 501 is further configured to acquire second voice information acquired by a microphone on the intelligent cabin, perform voice recognition on the second voice information, and acquire corresponding second text information; and responding to the second text message containing the preset awakening words, setting the intelligent cabin to be in a full-duplex voice interaction state, and triggering a multi-mode interaction flow of the intelligent cabin.

In some embodiments of the present application, the master control chip 501 is further configured to obtain natural language processing resource information based on the user voice information, and send the natural language processing resource information to the slave chip. The slave chip 502 is further configured to receive the natural language processing resource information sent by the master control chip, and control the second display device to display the natural language processing resource information.

According to the multi-modal interaction system of the intelligent cabin, the corresponding phonetics text, phonetics voice, face data and action data of the virtual life image can be obtained through the main control chip according to the cabin state information of the intelligent cabin, and a rendering picture of the target virtual life image is generated. And displaying the dialect text on the second display equipment by playing the rendering picture of the target virtual life image on the first display equipment so as to actively guide the user and remind the user of the current cabin state. And when the rendering picture of the target virtual life image is played on the first display device, setting the intelligent cabin to be in a full-duplex voice interaction state. The method comprises the steps of acquiring first voice information through a microphone, acquiring corresponding first semantic information, first dialect voice, first face data and first action data of a virtual life image according to the first voice information, and generating a first rendering picture of a target virtual life image. And playing a first rendering picture of the target virtual life image on the first display equipment so as to respond to the first voice information. The multi-mode interaction method between the human and the machine is realized through expressions, actions, voice and characters of the target virtual life image. The instruction is richer, and support the interactive process and exempt from to awaken up and continue the word, make the human-computer interaction process more be close people's interaction, and is more natural, vivid, promotes user interaction and experiences.

Fig. 7 is a schematic diagram of a multi-modal interaction system of a smart car according to a seventh embodiment of the present application. As shown in fig. 7, on the basis of the above-mentioned embodiment, the multi-modal interaction system for smart cabs may further include a wireless communication module 704.

Specifically, the wireless communication module 704 is responsible for intelligent in-cabin and intelligent out-of-cabin network communications. The slave chip 702 is connected to the second display device, and is further configured to perform audio/video call display with an external device through the wireless communication module 704.

Wherein, 701-703 in fig. 7 and 501-503 in fig. 5 have the same functions and structures.

For better understanding of the multi-modal interactive system of the smart car provided in the embodiments of the present application, the following description will be made in detail with reference to fig. 8. Fig. 8 is a schematic diagram of a multi-modal interaction system of a smart car according to an eighth embodiment of the present application. As shown in fig. 8, the cabin status information (e.g., passengers are seated) is sent to the slave chip via the lan communication module. And the slave chip sends the cabin state information to the main control chip through the local area network communication module. The main control chip maps the cabin state information into corresponding identification texts, and obtains audio and video resources such as corresponding dialect texts, dialect voices, face data and action data of virtual life images and the like according to the identification texts. The main control chip generates a rendering picture of the virtual life image according to the speech term sound, the face data of the virtual life image and the action data, and controls the first display device to play. The master chip sends the dialog text to the slave chip. The slave chip controls the second display device to display the spoken text. For the audio and video resources (such as video resources, map navigation, video call and the like) obtained by the master control chip, the master control chip sends the audio and video resources to the slave chip. And the slave chip controls the second display device to display. And for a first vehicle control instruction generated according to the first text information and the first voice information, the master control chip transmits the first vehicle control instruction to the vehicle machine system through the slave chip. And controlling the car machine system to execute corresponding operation. Such as in-vehicle environment replacement, trip settings, etc.

Fig. 9 is a block diagram of a multi-modal interaction device for a smart car according to a ninth embodiment of the present application. As shown in fig. 9, the multi-modal interaction apparatus of the smart car may include a first retrieving module 901, a mapping module 902, a second retrieving module 903, and a control module 904.

Specifically, the first obtaining module 901 is configured to obtain cabin status information of the intelligent cabin.

A mapping module 902, configured to map the cabin status information to a corresponding identification text.

And a second obtaining module 903, configured to obtain corresponding conversational text, conversational speech, face data of the virtual life image, and motion data according to the recognition text.

And a control module 904 for generating a rendered screen of the target avatar according to the speech term sound, the face data of the avatar, and the motion data, and displaying the rendered screen of the target avatar on the first display device and the spoken text on the second display device.

In some embodiments of the present application, the second obtaining module 903 is specifically configured to: sending the identification text to a cloud server; and acquiring the dialect text, the dialect voice, the face data of the virtual life image and the action data corresponding to the identification text from the cloud server.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

According to the multi-mode interaction device of the intelligent cabin, the corresponding phonetics text, phonetics voice, face data and action data of the virtual life image are obtained from the cloud server according to the cabin state information of the intelligent cabin, and a rendering picture of the target virtual life image is generated. And displaying the dialect text on the second display equipment by playing the rendering picture of the target virtual life image on the first display equipment so as to actively guide the user and remind the user of the current cabin state. The multi-mode interaction method between the human and the machine is realized through the expression, the action, the voice and the character of the target virtual life image, the instruction is richer, the human-machine interaction process is closer to human-machine interaction, and the method is more natural and vivid.

Fig. 10 is a structural block diagram of a multi-modal interaction device for a smart car according to an embodiment of the present application. As shown in fig. 10, on the basis of the above-mentioned embodiment, the multi-modal interaction apparatus for smart car may further include a third obtaining module 1005, a first generating module 1006, and a first sending module 1007.

Specifically, the third obtaining module 1005 is configured to obtain, according to the identification text, corresponding cabin status semantic information.

And the first generating module 1006 is configured to generate a corresponding vehicle control instruction according to the cabin state semantic information.

The first sending module 1007 is used for sending the car control instruction to the car machine system; the car control instruction is used for instructing a car machine system to correspondingly control the intelligent cabin.

Wherein 1001 and 1004 in fig. 10 and 901 and 904 in fig. 9 have the same functions and structures.

According to the multi-mode interaction device of the intelligent cabin, the corresponding phonetics text, phonetics voice, face data and action data of the virtual life image are obtained from the cloud server according to the cabin state information of the intelligent cabin, and a rendering picture of the target virtual life image is generated. And displaying the dialect text on the second display equipment by playing the rendering picture of the target virtual life image on the first display equipment so as to actively guide the user and remind the user of the current cabin state. And generating a corresponding vehicle control instruction according to the cabin state information of the intelligent cabin and sending the vehicle control instruction to the vehicle machine for related operation. The multi-mode interaction method between the human and the machine is realized through the expression, the action, the voice and the character of the target virtual life image, the instruction is richer, the human-machine interaction process is closer to human-machine interaction, and the method is more natural and vivid.

Fig. 11 is a block diagram of a multi-modal interaction apparatus for a smart car according to an eleventh embodiment of the present application. As shown in fig. 11, on the basis of the above-mentioned embodiment, the multi-modal interaction apparatus for smart car may further include a setting module 1108, a fourth obtaining module 1109, a voice recognition module 1110, a fifth obtaining module 1111, a second generating module 1112, and a second sending module 1113.

Specifically, the setting module 1108 is configured to set the smart cockpit to a full-duplex voice interaction state when the rendered screen of the target virtual avatar is played on the first display device.

And the fourth acquiring module 1109 is configured to acquire the first voice information acquired by the microphone on the smart car.

The voice recognition module 1110 is configured to perform voice recognition on the first voice information to obtain corresponding first text information.

The fifth obtaining module 1111 is configured to obtain, from the cloud server, the first semantic information, the first speech, the first facial data of the virtual avatar, and the first action data, which correspond to the first text information.

The control module 1104 is further configured to generate a first rendering picture of the target avatar according to the first speech, the first facial data of the avatar, and the first motion data, and play the first rendering picture of the target avatar on the first display device.

The second generating module 1112 is configured to generate a corresponding first vehicle control instruction according to the first text information and the first semantic information.

The second sending module 1113 is configured to send a first vehicle control instruction to the vehicle machine system, where the first vehicle control instruction is used to instruct the vehicle machine system to execute a corresponding operation.

Wherein 1101-19 1107 in fig. 11 and 1001-1007 in fig. 10 have the same function and structure.

According to the multi-mode interaction device of the intelligent cabin, the corresponding phonetics text, phonetics voice, face data and action data of the virtual life image are obtained from the cloud server according to the cabin state information of the intelligent cabin, and a rendering picture of the target virtual life image is generated. And displaying the dialect text on the second display equipment by playing the rendering picture of the target virtual life image on the first display equipment so as to actively guide the user and remind the user of the current cabin state. And when the rendering picture of the target virtual life image is played on the first display device, setting the intelligent cabin to be in a full-duplex voice interaction state. The method comprises the steps of acquiring first voice information through a microphone, acquiring corresponding first semantic information, first speech, first face data of the virtual life image and first action data from a cloud server according to the first voice information, and generating a first rendering picture of the target virtual life image. And playing a first rendering picture of the target virtual life image on the first display equipment so as to respond to the first voice information. The multi-mode interaction method between the human and the machine is realized through expressions, actions, voice and characters of the target virtual life image. The instruction is richer, and support the interactive process and exempt from to awaken up and continue the word, make the human-computer interaction process more be close people's interaction, and is more natural, vivid, promotes user interaction and experiences.

Fig. 12 is a block diagram of a multi-modal interaction apparatus for a smart car according to a twelfth embodiment of the present application. As shown in fig. 12, on the basis of the above embodiment, the multi-modal interaction apparatus for smart car may further include a sixth obtaining module 1214 and a triggering module 1215.

Specifically, the sixth obtaining module 1214 is configured to obtain second voice information collected by a microphone on the intelligent cabin, perform voice recognition on the second voice information, and obtain corresponding second text information;

and the triggering module 1215 is configured to set the intelligent cabin to a full-duplex voice interaction state in response to that the second text message includes a preset wake-up word, and trigger a multi-modal interaction process of the intelligent cabin.

Wherein 1201-1213 in fig. 12 and 1101-1113 in fig. 11 have the same functions and structures.

According to an embodiment of the application, the application further provides an intelligent cabin, and the intelligent cabin comprises the multi-modal interaction system according to any one of the embodiments shown in fig. 5 and 7.

There is also provided, in accordance with an embodiment of the present application, an intelligent cockpit, a readable storage medium, and a computer program product.

Fig. 13 is a schematic diagram of a smart car for implementing a multi-modal interaction method according to an embodiment of the present application. The intelligent cabin can be an intelligent cabin in an automobile or an intelligent cabin in other vehicles such as a helicopter. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 13, the intelligent cabin comprises: one or more processors 1301, memory 1302, and interfaces for connecting the various components, including high speed interfaces and low speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the smart cabin, including instructions stored in or on the memory to display graphical information of the GUI on an external input/output device (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 13 illustrates an example of a processor 1301.

Memory 1302 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the multimodal interaction method provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the multimodal interaction method provided herein.

The memory 1302, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the multi-modal interaction method in the embodiments of the present application (for example, the first obtaining module 1201, the mapping module 1202, the second obtaining module 1203, the control module 1204, the third obtaining module 1205, the first generating module 1206, the first sending module 1207, the setting module 1208, the fourth obtaining module 1209, the voice recognition module 1210, the fifth obtaining module 1211, the second generating module 1212, the second sending module 1213, the sixth obtaining module 1214, and the triggering module 1215 shown in fig. 12). The processor 1301 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 1302, that is, implements the multimodal interaction method in the above method embodiments.

The memory 1302 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created by use of the smart car to implement the multi-modal interaction method, and the like. Further, the memory 1302 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 1302 may optionally include memory located remotely from processor 1301, which may be connected over a network to a smart car to implement a multimodal interaction method. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The smart car to implement the multi-modal interaction method may further include: an input device 1303 and an output device 1304. The processor 1301, the memory 1302, the input device 1303 and the output device 1304 may be connected by a bus or other means, and fig. 13 illustrates the bus connection.

The input device 1303 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the smart car to implement the multi-modal interaction method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, and the like. The output devices 1304 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that when executed by a processor implement the multimodal interaction method described in the embodiments above, the one or more computer programs being executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain. It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A multi-modal interaction method for a smart car, comprising:

acquiring cabin state information of the intelligent cabin;

mapping the cabin state information into corresponding identification texts;

2. The method of claim 1, further comprising:

acquiring corresponding cabin state semantic information according to the identification text;

generating a corresponding vehicle control instruction according to the cabin state semantic information;

sending the vehicle control command to a vehicle machine system; the car control instruction is used for indicating the car machine system to correspondingly control the intelligent cabin.

3. The method of claim 1, wherein said obtaining corresponding conversational text, conversational speech, facial data and motion data of an avatar based on said recognition text comprises:

sending the identification text to a cloud server;

and acquiring the dialect text, the dialect voice, the face data of the virtual life image and the action data corresponding to the identification text from the cloud server.

4. The method of claim 1, further comprising:

setting the intelligent cabin to be in a full-duplex voice interaction state when a rendering picture of the target virtual life image is played on first display equipment;

acquiring first voice information acquired by a microphone on the intelligent cabin;

performing voice recognition on the first voice information to obtain corresponding first text information;

acquiring first semantic information, first speech technology voice, first face data of the virtual life image and first action data corresponding to the first text information from a cloud server;

generating a first rendering picture of a target virtual life image according to the first speaking voice, the first face data of the virtual life image and the first action data, and playing the first rendering picture of the target virtual life image on a first display device;

generating a corresponding first vehicle control instruction according to the first text information and the first semantic information;

sending the first vehicle control instruction to a vehicle machine system; the first vehicle control instruction is used for indicating the vehicle machine system to execute corresponding operation.

5. The method of claim 1, further comprising:

acquiring second voice information acquired by a microphone on the intelligent cabin, and performing voice recognition on the second voice information to acquire corresponding second text information;

and responding to the second text message containing a preset awakening word, setting the intelligent cabin to be in a full-duplex voice interaction state, and triggering a multi-mode interaction flow of the intelligent cabin.

6. A multi-modal interactive system for a smart car, comprising: a master control chip, a slave chip and a local area network communication module, wherein,

7. The system of claim 6, wherein,

the main control chip is also used for acquiring corresponding cabin state semantic information according to the identification text; generating a corresponding vehicle control instruction according to the cabin state semantic information, and transmitting the vehicle control instruction to the vehicle-mounted machine system through the slave chip; the car control instruction is used for indicating the car machine system to correspondingly control the intelligent cabin.

8. The system of claim 6, wherein,

the main control chip is further configured to set the intelligent cabin to a full-duplex voice interaction state when the rendering picture of the target virtual life image is played on the first display device, acquire first voice information acquired by a microphone on the intelligent cabin, perform voice recognition on the first voice information, acquire corresponding first text information, acquire first semantic information, first speech technology voice, first face data of the virtual life image and first action data corresponding to the first text information, generate a first rendering picture of the target virtual life image according to the first speech technology voice, the first face data of the virtual life image and the first action data, and control the first display device to play the first rendering picture of the target virtual life image;

the master control chip is further configured to generate a corresponding first vehicle control instruction according to the first text information and the first semantic information, and transmit the first vehicle control instruction to the vehicle-mounted device system through the slave chip; the first vehicle control instruction is used for indicating the vehicle machine system to execute corresponding operation.

9. The system of claim 6, wherein the master control chip is further configured to:

10. The system of claim 6, wherein,

the master control chip is also used for acquiring natural language processing resource information based on user voice information and sending the natural language processing resource information to the slave chip;

the slave chip is further configured to receive the natural language processing resource information sent by the master control chip, and control the second display device to display the natural language processing resource information.

11. The system of claim 6, further comprising:

the wireless communication module is used for being responsible for communication between the intelligent cabin and the intelligent cabin external network;

the slave chip is connected with the second display device and is further used for audio and video call display with external equipment through the wireless communication module.

12. A multi-modal interaction device for a smart car, comprising:

13. The apparatus of claim 12, further comprising:

the third acquisition module is used for acquiring corresponding cabin state semantic information according to the identification text;

the first generation module is used for generating a corresponding vehicle control instruction according to the cabin state semantic information;

the first sending module is used for sending the vehicle control instruction to a vehicle machine system; the car control instruction is used for indicating the car machine system to correspondingly control the intelligent cabin.

14. The apparatus according to claim 12, wherein the second obtaining module is specifically configured to:

sending the identification text to a cloud server;

15. The apparatus of claim 12, further comprising:

the setting module is used for setting the intelligent cabin to be in a full-duplex voice interaction state when the rendering picture of the target virtual life image is played on the first display equipment;

the fourth acquisition module is used for acquiring the first voice information acquired by the microphone on the intelligent cabin;

the voice recognition module is used for carrying out voice recognition on the first voice information to obtain corresponding first text information;

the fifth acquisition module is used for acquiring first semantic information, first speech technology voice, first face data of the virtual life image and first action data corresponding to the first text information from the cloud server;

the control module is further configured to generate a first rendering picture of the target avatar according to the first speech, the first facial data of the avatar, and the first action data, and play the first rendering picture of the target avatar on a first display device;

the second generation module is used for generating a corresponding first vehicle control instruction according to the first text information and the first semantic information;

the second sending module is used for sending the first vehicle control instruction to the vehicle-mounted machine system; the first vehicle control instruction is used for indicating the vehicle machine system to execute corresponding operation.

16. The apparatus of claim 12, further comprising:

the sixth acquisition module is used for acquiring second voice information acquired by a microphone on the intelligent cabin, and performing voice recognition on the second voice information to acquire corresponding second text information;

and the triggering module is used for responding to the second text message containing a preset awakening word, setting the intelligent cabin to be in a full-duplex voice interaction state, and triggering a multi-mode interaction process of the intelligent cabin.

17. A smart cockpit, comprising:

a multimodal interaction system as claimed in any one of claims 6 to 11.

18. A smart cockpit, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 5.

19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 5.

20. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1 to 5.