CN115859219A

CN115859219A - Multi-modal interaction method, device, equipment and storage medium

Info

Publication number: CN115859219A
Application number: CN202211657075.8A
Authority: CN
Inventors: 朱思邈; 郭尧; 曾通; 冷永才; 赵晨旭
Original assignee: Geely Automobile Research Institute Ningbo Co Ltd
Current assignee: Geely Automobile Research Institute Ningbo Co Ltd
Priority date: 2022-12-22
Filing date: 2022-12-22
Publication date: 2023-03-28

Abstract

The application provides a multi-modal interaction method, a multi-modal interaction device and a multi-modal interaction storage medium, and relates to the technical field of automobiles. According to the method and the device, the multi-mode output is generated by combining the emotion labels of the user, the multi-mode output of vehicle-mounted voice interaction is achieved, and the user experience is improved.

Description

Multi-modal interaction method, device, equipment and storage medium

Technical Field

The present application relates to the field of automotive technologies, and in particular, to a multi-modal interaction method, apparatus, device, and storage medium.

Background

In recent years, with the increasing level of intelligence of the automobile industry, the development of the automobile industry towards intellectualization has become a necessary trend. The intelligent cockpit is an important component of the intelligent development of the whole automobile, and new technology, new scenes and new modes contained in the intelligent cockpit are continuously emerging. The vehicle-mounted voice interaction is used as the most direct, humanized, safe and efficient intelligent product contained in the current intelligent cockpit, the requirements of automobile drivers and passengers on the vehicle-mounted voice interaction are controlled from basic instructions, and more requirements are provided for personification, affection, richness and interestingness of the voice interaction.

With the development of Artificial Intelligence (AI) technology and the enhancement of hardware performance of a car machine, the development of a traditional single-mode interaction mode to a multi-mode interaction mode for car-mounted voice interaction becomes a necessary trend, and therefore, how to realize the multi-mode interaction of car-mounted voice is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a multi-modal interaction method, a multi-modal interaction device, multi-modal interaction equipment and a storage medium, which are used for realizing multi-modal interaction of vehicle-mounted voice.

In a first aspect, an embodiment of the present application provides a multi-modal interaction method, including: acquiring vehicle environment information and user behavior information of a target vehicle; determining emotion labels of the user according to the vehicle environment information and the user behavior information, wherein the emotion labels comprise positive labels and negative labels; based on a vehicle-mounted scene, generating multi-modal output according to the emotion labels, wherein the vehicle-mounted scene comprises a parking scene and a driving scene, and the multi-modal output corresponding to different emotion labels is different.

In one possible implementation, generating a multimodal output based on an emotional tag based on a vehicle scene includes: if the vehicle-mounted scene is a driving scene, generating first content corresponding to the emotion label through a text generation technology and an image generation technology, and outputting the first content by adopting a text mode and an image mode; or if the vehicle-mounted scene is a driving scene, generating second content corresponding to the emotion label through a text generation technology and an audio generation technology, and outputting the second content by adopting a text mode and an audio mode.

In one possible implementation, generating a multimodal output from emotion tags based on an on-board scene includes: if the vehicle-mounted scene is a parking scene, generating third content corresponding to the emotion label through a text generation technology and a video generation technology; and outputting the third content by adopting a text mode and a video mode.

In one possible implementation manner, the acquiring the vehicle environment information and the triggering condition of the user behavior information of the target vehicle includes: acquiring a user instruction; based on the on-vehicle scene, generating a multimodal output from the emotion tags, comprising: based on the vehicle-mounted scene, generating a multimodal output according to the emotion labels and the user instructions.

In one possible implementation manner, determining the emotion label of the user according to the vehicle environment information and the user behavior information includes: respectively determining emotion labels corresponding to the vehicle environment information and the user behavior information; determining a first total number of active tags and/or a second total number of passive tags in the emotional tags; in response to the first total number being greater than a first threshold, determining that the emotional tag of the user is an active tag; responsive to the second total number being greater than a second threshold, determining that the emotional tag of the user is a negative tag.

In one possible implementation, the emotion tags further include neutral tags, and the multi-modal interaction method further includes: in response to the first total number being less than or equal to a first threshold and the second total number being less than or equal to a second threshold, determining that the emotional tag of the user is a neutral tag; generating fourth content through a text generation technology; and outputting the fourth content by adopting a text mode.

In one possible implementation, determining an emotion label of a user according to vehicle environment information and user behavior information includes: and inputting the vehicle environment information and the user behavior information into the language representation model to obtain the emotion label of the user output by the language representation model.

In one possible implementation, the vehicle environment information includes a vehicle speed and a gear, and before generating the multi-modal output according to the emotion tag based on the vehicle-mounted scene, the method further includes: when the vehicle speed is 0 and/or the gear is a neutral gear, determining that the vehicle-mounted scene is a parking scene; and when the vehicle speed is not 0 and the gear is not neutral, determining that the vehicle-mounted scene is a driving scene.

In a second aspect, an embodiment of the present application provides a multimodal interaction apparatus, including: the acquisition module is used for acquiring vehicle environment information and user behavior information of the target vehicle; the system comprises a first determination module, a second determination module and a third determination module, wherein the first determination module is used for determining emotion labels of a user according to vehicle environment information and user behavior information, and the emotion labels comprise positive labels and negative labels; and the output module is used for generating multi-modal output according to the emotion tags based on a vehicle-mounted scene, the vehicle-mounted scene comprises a parking scene and a driving scene, and the multi-modal output types corresponding to different emotion tags are different.

In a possible implementation manner, the output module is specifically configured to: if the vehicle-mounted scene is a driving scene, generating first content corresponding to the emotion label through a text generation technology and an image generation technology, and outputting the first content by adopting a text mode and an image mode; or if the vehicle-mounted scene is a driving scene, generating second content corresponding to the emotion label through a text generation technology and an audio generation technology, and outputting the second content by adopting a text mode and an audio mode.

In one possible implementation manner, the output module may be further configured to: if the vehicle-mounted scene is a parking scene, generating third content corresponding to the emotion label through a text generation technology and a video generation technology; and outputting the third content by adopting a text mode and a video mode.

In a possible implementation manner, the obtaining module is specifically configured to: the triggering conditions for acquiring the vehicle environment information and the user behavior information of the target vehicle comprise: acquiring a user instruction; based on the on-vehicle scene, generate multi-modal output according to the emotion label, including: based on the vehicle-mounted scene, generating a multimodal output according to the emotion labels and the user instructions.

In a possible implementation manner, the first determining module is specifically configured to: respectively determining emotion labels corresponding to the vehicle environment information and the user behavior information; determining a first total number of active tags and/or a second total number of passive tags in the emotional tags; in response to the first total number being greater than a first threshold, determining that the emotional tag of the user is an active tag; in response to the second total being greater than a second threshold, determining that the emotional tag of the user is a negative tag.

In one possible implementation, the emotion tag further includes a neutral tag, and the first determining module is further configured to: in response to the first total number being less than or equal to a first threshold and the second total number being less than or equal to a second threshold, determining that the emotional tag of the user is a neutral tag; generating fourth content through a text generation technology; and outputting the fourth content by adopting a text mode.

In one possible implementation manner, the first determining module may be further configured to: and inputting the vehicle environment information and the user behavior information into the language representation model to obtain the emotion label of the user output by the language representation model.

In one possible implementation manner, the vehicle environment information includes a vehicle speed and a gear, and the multimodal interaction apparatus further includes a second determination module, where the second determination module is configured to: when the vehicle speed is 0 and/or the gear is a neutral gear, determining that the vehicle-mounted scene is a parking scene; and when the vehicle speed is not 0 and the gear is not a neutral gear, determining that the vehicle-mounted scene is a driving scene.

In a third aspect, the present application provides an electronic device, comprising: at least one processor; and a memory coupled to the at least one processor; wherein the memory is configured to store instructions executable by the at least one processor to enable the at least one processor to perform the multi-modal interaction method provided by the first aspect.

In a fourth aspect, the present application provides a computer-readable storage medium having stored therein computer-executable instructions for implementing the multimodal interaction method provided in the first aspect when the computer-executable instructions are executed.

In a fifth aspect, the present application provides a program product comprising computer executable instructions. When executed by a computer, the instructions implement the multi-modal interaction method provided by the first aspect.

The application provides a multi-modal interaction method, a multi-modal interaction device, a multi-modal interaction equipment and a multi-modal output storage medium. In the application, the multi-modal output is generated according to the emotion labels obtained by multi-modal input such as vehicle environment information, user behavior information and the like, so that the multi-modal output is creative multi-modal content meeting the intention of a user, multi-modal output of vehicle-mounted voice interaction is realized, and user experience is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart of a multimodal interaction method provided by an embodiment of the present application;

FIG. 2 is a block diagram of a multimodal output provided by an embodiment of the present application;

FIG. 3 is a flow diagram of a multimodal interaction method provided by another embodiment of the present application;

FIG. 4 is a schematic structural diagram of a multi-modal interaction apparatus provided in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

First, terms referred to in the embodiments of the present application will be explained.

Multi-modal interaction: the modality is a channel through which a living being receives information by means of sense organs and experiences, for example, human beings have visual, auditory, olfactory, taste and touch modalities, i.e., a multi-modality fuses multiple senses. The multi-modal interaction means that interaction technologies of a plurality of senses are combined together, man-machine interaction is carried out through various modes such as characters, voice, vision, actions and environments, and the interaction mode between people is fully simulated.

In the related technology, a multimodal interaction mode of vehicle-mounted voice interaction is to generate and output content corresponding to a single text modality by multimodal information input and a text generation technology. The output result obtained by the multi-mode interaction mode is single, and the input user information cannot be combined with the vehicle-mounted scene characteristics of the current vehicle to obtain creative multi-mode output, so that the user experience is poor.

Based on the problems in the related art, the emotion labels which can reflect emotion and intention of the user in the current environment are obtained by obtaining real-time environment information and user behavior information of the vehicle, and multi-modal output is generated according to the determined emotion labels based on the vehicle-mounted scene where the vehicle is located at present, so that multi-modal output of vehicle-mounted voice interaction is achieved, and user experience is improved.

For ease of understanding, an application scenario of the embodiment of the present application is first described.

The application scenario provided by the embodiment of the application includes a vehicle and a server. Wherein, the vehicle and the server are connected through communication. In some embodiments, the vehicle environment information may include vehicle interior environment information and vehicle exterior environment information. When the execution subject executing the multimodal interaction method is a server, the server can acquire vehicle environment information and user behavior information from a target vehicle; alternatively, the server acquires the vehicle interior environment information and the user behavior information from the target vehicle, and acquires the vehicle exterior environment information, such as weather information, from the internet or other devices or locally.

In addition, the execution subject of the multi-modal interaction method provided by the embodiment of the application can also be a vehicle-end system.

Based on the application scenario, the multi-modal interaction method provided by the present application is described in detail below with reference to specific embodiments.

Fig. 1 is a flowchart of a multimodal interaction method according to an embodiment of the present application. As shown in fig. 1, the multi-modal interaction method includes the following steps:

s101, vehicle environment information and user behavior information of the target vehicle are obtained.

Alternatively, the vehicle environment information may include in-vehicle environment information and out-vehicle environment information. The in-vehicle environment information may include, but is not limited to, an in-vehicle temperature, a vehicle speed, a gear, a multimedia style and state, an in-vehicle atmosphere lamp state, a current navigation road condition, a user state, and the like. Specifically, the in-vehicle temperature, the vehicle speed and the gear can be real-time in-vehicle temperature information, vehicle speed information and gear state of the target vehicle; the multimedia state and style may be a current multimedia response state of the target vehicle and a corresponding style characteristic, for example, the multimedia response state may be currently played music or the like, and the style may be a music style corresponding to the music; the state of the atmosphere lamp in the vehicle can be an off state, a manual state, an on state, a specific atmosphere lamp color state and the like; the current navigation road condition can be severe congestion, slight congestion, smooth road and the like; the user state may be a fatigue state of the user, a mood of the user, and the like, and for example, the user state may be obtained through a Driver Monitor System (DMS for short). The DMS is a man-machine interaction system based on intelligent driving. Specifically, the DMS collects the eye conditions (eye tracking, sight tracking, etc.) of the driver in real time by using the vehicle-mounted optical camera and the infrared camera, and analyzes the conditions of the driver by using a deep learning method, thereby monitoring the driver's identity recognition, fatigue monitoring, driver's attention, dangerous driving, etc., and performing a hierarchical early warning.

The off-board environmental information may include, but is not limited to, weather information, time information, date information, off-board temperature, and geographic location, among others. Specifically, the method comprises the following steps. The geographic location may be the current latitude and longitude of the target vehicle.

Optionally, the user behavior information is an operation behavior actively triggered by the user. For example, the behavior information may be window opening and air conditioning opening.

The acquisition of the vehicle environment information and the user behavior information of the target vehicle is based on the target vehicle being in an ignition state. Specifically, the instruction may be obtained when the user instruction is obtained, or may be obtained in real time. According to the embodiment of the application, after the target vehicle is in the ignition state, the acquisition time of the vehicle environment information and the user behavior information of the target vehicle is not limited.

In a possible implementation manner, when the user instruction is obtained, the vehicle environment information of the target vehicle may be information obtained in real time, and the user behavior information may be user behavior information obtained in a time period before the time point when the user instruction is obtained. Illustratively, the time period may be 3 minutes, 5 minutes, or the like. When the time for acquiring the user instruction is 10 am, the user behavior information may be acquired in a time period corresponding to 10 am, 5 minutes to 10 am, and 10 minutes.

And S102, determining emotion labels of the user according to the vehicle environment information and the user behavior information, wherein the emotion labels comprise positive labels and negative labels.

Alternatively, the emotion tag of the user may be a tag for reflecting the current mood of the user.

In some embodiments, the emotion tag corresponding to each piece of vehicle environment information and the emotion tag corresponding to each piece of user behavior information may be determined respectively, the total number of the corresponding emotion tags may be counted, and the emotion tag of the user may be determined according to the final statistical result. For example, the specific statistical process may be automatically calculated by the vehicle-end system, and the final statistical result may be output.

And S103, generating multi-modal output according to the emotion tags based on a vehicle-mounted scene, wherein the vehicle-mounted scene comprises a parking scene and a driving scene, and the multi-modal output corresponding to different emotion tags is different.

It can be understood that the vehicle-mounted scene can be divided into a parking scene and a driving scene based on the driving safety problem to be considered in the vehicle driving process. Specifically, the parking scene is that the vehicle is in a stopped state, and the driving scene is that the vehicle is in a driving state. It should be noted that the vehicle is still in the ignition state in the parking scene.

It should be noted that when the vehicle environment information and the user behavior information are obtained based on obtaining the user instruction, the specific content output in the multi-mode is obtained based on the vehicle-mounted scene according to the emotion tag and the user instruction; when the vehicle environment information and the user behavior information are not obtained based on the obtained user instruction, the specific content of the multi-mode output is obtained based on the vehicle-mounted scene according to the emotion label.

Alternatively, the multiple modalities may be information of multiple modalities, which may include, for example, an image modality, an audio modality, a video modality, a text modality, and the like.

In a possible implementation manner, based on the vehicle-mounted scene, the content corresponding to the multi-modal output can be created by combining at least one of an image generation technology, an audio generation technology and a video generation technology with a text generation technology according to an emotion tag and a user instruction or according to the emotion tag, and the content is rich and diverse, is close to the user scene, and meets the user intention.

It should be noted that the multi-modal output content is created in real time by understanding the user's intention according to the emotion tags and the user instructions, and the created content, i.e., the multi-modal output content, is creative, unique and interesting. Specifically, image, audio, and video creation can be performed by the AI. Illustratively, the content corresponding to the audio modality does not belong to an existing song.

In the embodiment of the application, vehicle environment information and user behavior information of a target vehicle are obtained, an emotion tag of a user is determined to be a positive tag or a negative tag according to the emotion tag corresponding to the vehicle environment information and the user behavior information, and multi-mode output is generated according to the determined emotion tag based on a vehicle-mounted scene where the target vehicle is located. In the application, the multi-modal output is generated according to the emotion labels obtained by multi-modal input such as vehicle environment information, user behavior information and the like, so that the multi-modal output is creative multi-modal content meeting the intention of a user, multi-modal output of vehicle-mounted voice interaction is realized, and user experience is improved.

On the basis of the above embodiments, in some embodiments, the trigger condition for acquiring the vehicle environment information and the user behavior information of the target vehicle may be acquiring a user instruction. Optionally, based on the in-vehicle scene, generating the multimodal output according to the emotion label may specifically include: and based on the vehicle-mounted scene, performing multi-mode output according to the emotion label and the user instruction. The user instruction is an instruction without explicit user intention, and for example, the user instruction may be: ' sing me a song Bar! "," guess how much did i's mood, "and" i'm too hard to get, etc.

The multi-modal interaction method provided by the embodiment of the application can comprise the steps that after a user instruction is obtained, a vehicle-end system or a server obtains current vehicle environment information and user information, a current emotion label of a user is determined according to the vehicle environment information and the user behavior information, the user instruction is combined with the current emotion label of the user based on a vehicle-mounted scene of the current vehicle to obtain the intention of the user, AI creation is carried out through an image generation technology, a video generation technology, an audio generation technology and a text generation technology, multi-modal output which is more in line with the intention of the user is obtained, multi-modal interaction of vehicle-mounted voice is achieved, and user experience is further improved.

It can be understood that, according to the difference between the vehicle-mounted scene and the emotion tag of the user, the content of the multi-modal output corresponding to the vehicle-mounted scene is different, and the following describes in detail different implementations of generating the multi-modal output according to the emotion tag based on the vehicle-mounted scene through specific embodiments.

Optionally, based on the vehicle-mounted scene in the foregoing embodiment, generating the multi-modal output according to the emotion tag may specifically include the following implementation manners: if the vehicle-mounted scene is a driving scene, generating first content corresponding to the emotion label through a text generation technology and an image generation technology, and outputting the first content by adopting a text mode and an image mode; or if the vehicle-mounted scene is a driving scene, generating second content corresponding to the emotion label through a text generation technology and an audio generation technology, and outputting the second content by adopting a text mode and an audio mode. The generation manner of the first content and the second content is similar to that described above, and is not described here again.

In some embodiments, the first content output may be image content with commentary. Specifically, the image content is output in an image mode, and the corresponding explanation is output in a text mode. For example, when the emotion tag is an active tag, the first content output may be an art image with narration; when the emotional tag is a negative tag, the first content output may be a cured art image with narration.

In some embodiments, the output second content may be audio content with preceding reminder words. Specifically, the audio content is output in an audio mode, and the corresponding preamble reminding words are output in a text mode. For example, when the emotion label is an active label, the output first content may be happy music with a preceding reminder word; when the emotion label is a negative label, the output first content may be cured music with a preceding reminding word, the preceding reminding word may be 'go to a happy concert below you', and the like.

Optionally, based on the vehicle-mounted scene in the above embodiment, the generating the multi-modal output according to the emotion tag may further include: if the vehicle-mounted scene is a parking scene, generating third content corresponding to the emotion label through a text generation technology and a video generation technology; and outputting the third content by adopting a text mode and a video mode. The generation manner of the third content is similar to that described above, and is not described here again.

In some embodiments, the output third content may be a video with a monologue. Specifically, the video content is output in a video mode, and the corresponding monologue is output in a text mode. For example, when the emotion tag is an active tag, the output third content may be a short video with a single whitish character; when the emotional tag is a negative tag, the output third content may be a short video with monologue that is cured.

It should be noted that the text modality generated by the text generation technology provided in the embodiment of the present application is output by converting into a form of speech.

Fig. 2 is a block diagram of a multi-modal output provided by an embodiment of the present application. As shown in fig. 2, different vehicle-mounted scenes and different scenes obtained by combining emotion labels may be combined, and different multimodal contents may be output at the same time, where a specific implementation manner and the output multimodal contents are similar to those described above and are not described herein again.

In the embodiment of the application, based on a driving scene and a parking scene, according to an active tag or a passive tag of a user, an image generation technology, a video generation technology or an audio generation technology is combined with a text generation technology to generate multi-mode content which can reflect the intention of the user and has creativity, uniqueness and interestingness, and multi-mode interaction of vehicle-mounted voice is achieved.

The following describes in detail an implementation manner of determining the emotion label of the user according to the vehicle environment information and the user behavior information in step S102 with reference to fig. 3.

Fig. 3 is a flowchart of a multimodal interaction method according to another embodiment of the present application. As shown in fig. 3, determining the emotion label of the user according to the vehicle environment information and the user behavior information may specifically include the following steps:

s301, emotion labels corresponding to the vehicle environment information and the user behavior information are determined respectively.

In a possible implementation manner, the emotion labels corresponding to the vehicle environment information and the user behavior information can be respectively determined according to the vehicle environment information and the mapping relation between the user behavior information and the emotion labels stored in the vehicle-side system. Illustratively, the mapping relation between the vehicle speed and the emotion tag stored in the vehicle-side system is that the vehicle speed is less than 60Km/h, the vehicle speed is a negative tag, and is greater than or equal to 60Km/h, and when the vehicle speed is a positive tag, the vehicle speed in the vehicle environment information acquired in real time is determined, and the corresponding emotion tag is determined according to the specific vehicle speed.

S302, a first total number of active tags and/or a second total number of passive tags in the emotion tags is determined.

For example, the first total number of positive tags and/or the second total number of negative tags in the emotion tags may be counted based on a statistical method.

And S303, responding to the first total number larger than a first threshold value, and determining the emotion label of the user as an active label.

And S304, in response to the second total number being larger than a second threshold value, determining that the emotion tag of the user is a negative tag.

It should be noted that, in the multi-modal interaction method provided in the embodiment of the present application, the first threshold and the second threshold are set to satisfy a scenario that the first total number of active tags is greater than the first threshold and the second total number of passive tags is greater than the second threshold.

In the embodiment of the application, the emotion labels corresponding to the vehicle environment information and the user behavior information are respectively determined, the first total number of active labels and/or the second total number of passive labels in the emotion labels are further determined, and the emotion labels are determined according to the first total number and the second total number, so that the current emotion of a user can be understood, and multi-mode content meeting the intention of the user can be output according to the emotion labels.

Optionally, the emotion tag may further include a neutral tag, and in particular, the multimodal interaction method provided in the embodiment of the present application may further include the following implementation manners: in response to the first total number being less than or equal to a first threshold and the second total number being less than or equal to a second threshold, determining that the emotional tag of the user is a neutral tag; generating fourth content through a text generation technology; and outputting the fourth content by adopting a text mode. Specifically, the fourth content may be voice content for responding to a user instruction.

Optionally, determining the emotion label of the user according to the vehicle environment information and the user behavior information may also be implemented by: and inputting the vehicle environment information and the user behavior information into the language representation model to obtain the emotion label of the user output by the language representation model. Specifically, the language representation model may store a mapping relationship between the vehicle environment information and the user behavior information and the emotion label, and may be deployed in a vehicle-end system or a server. Illustratively, the language characterization model may be a BERT (Bidirectional Encoder expressions from transformations) model.

It is understood that the vehicle-mounted scene may be divided into a driving scene and a parking scene in consideration of driving safety. And outputting corresponding multi-modal content to the user based on the vehicle-mounted scene. In some embodiments, before generating the multi-modal output according to the emotion tag based on the vehicle-mounted scene, the vehicle-mounted scene needs to be determined, and the specific implementation manner may be: when the vehicle speed is 0 and/or the gear is a neutral gear, determining that the vehicle-mounted scene is a parking scene; and when the vehicle speed is not 0 and the gear is not neutral, determining that the vehicle-mounted scene is a driving scene.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Fig. 4 is a schematic structural diagram of a multi-modal interaction apparatus according to an embodiment of the present application. As shown in fig. 4, the multi-modal interaction apparatus 40 includes: an acquisition module 410, a first determination module 420, and an output module 430.

The obtaining module 410 is configured to obtain vehicle environment information and user behavior information of a target vehicle; a first determining module 420 for determining emotional tags of the user according to the vehicle environment information and the user behavior information, wherein the emotional tags include a positive tag and a negative tag; and the output module 430 is configured to generate a multi-modal output according to the emotion tag based on a vehicle-mounted scene, where the vehicle-mounted scene includes a parking scene and a driving scene, and the multi-modal output corresponding to different emotion tags is different.

In a possible implementation manner, the output module 430 is specifically configured to: if the vehicle-mounted scene is a driving scene, generating first content corresponding to the emotion label through a text generation technology and an image generation technology, and outputting the first content by adopting a text mode and an image mode; or if the vehicle-mounted scene is a driving scene, generating second content corresponding to the emotion tag through a text generation technology and an audio generation technology, and outputting the second content by adopting a text mode and an audio mode.

In one possible implementation, the output module 430 may further be configured to: if the vehicle-mounted scene is a parking scene, generating third content corresponding to the emotion label through a text generation technology and a video generation technology; and outputting the third content by adopting a text mode and a video mode.

In a possible implementation manner, the obtaining module 410 is specifically configured to: the triggering conditions for acquiring the vehicle environment information and the user behavior information of the target vehicle comprise: acquiring a user instruction; based on the on-vehicle scene, generating a multimodal output from the emotion tags, comprising: based on the vehicle-mounted scene, generating a multimodal output according to the emotion labels and the user instructions.

In a possible implementation manner, the first determining module 420 is specifically configured to: respectively determining emotion labels corresponding to the vehicle environment information and the user behavior information; determining a first total number of active tags and/or a second total number of passive tags in the emotional tags; in response to the first total number being greater than a first threshold, determining that the emotional tag of the user is an active tag; in response to the second total being greater than a second threshold, determining that the emotional tag of the user is a negative tag.

In one possible implementation, the emotion tags further include a neutral tag, and the first determining module 420 is further configured to: in response to the first total number being less than or equal to a first threshold and the second total number being less than or equal to a second threshold, determining that the emotional tag of the user is a neutral tag; generating fourth content through a text generation technology; and outputting the fourth content by adopting a text mode.

In one possible implementation manner, the first determining module 420 may be further configured to: and inputting the vehicle environment information and the user behavior information into the language representation model to obtain the emotion label of the user output by the language representation model.

In one possible implementation, the vehicle environment information includes a vehicle speed and a gear, and the multimodal interaction apparatus further includes a second determining module (not shown) configured to: when the vehicle speed is 0 and/or the gear is a neutral gear, determining that the vehicle-mounted scene is a parking scene; and when the vehicle speed is not 0 and the gear is not neutral, determining that the vehicle-mounted scene is a driving scene.

The apparatus provided in the embodiment of the present application may be configured to perform the method steps provided in the foregoing method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can all be implemented in the form of software invoked by a processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the processing module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a function of the processing module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element here may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when one of the above modules is implemented in the form of a Processing element scheduler code, the Processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor capable of calling program code. For another example, these modules may be integrated together and implemented in the form of a System-On-a-Chip (SOC).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The processes or functions according to the embodiments of the present application are generated in whole or in part when the computer instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL), for short) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a Digital Versatile Disk (DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), among others.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device 50 includes: at least one processor 510, memory 520, a communication interface 530, and a system bus 540. The memory 520 and the communication interface 530 are connected to the processor 510 through the system bus 540 and complete mutual communication, the memory 520 is used for storing instructions, the communication interface 530 is used for communicating with other devices, and the processor 510 is used for calling the instructions in the memory to execute the method steps provided by the above method embodiments.

The system bus 540 mentioned in fig. 5 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The system bus 540 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 530 is used to enable communication between the database access device and other devices (e.g., clients, read-write libraries, and read-only libraries).

The Memory 520 may include a Random Access Memory (RAM) and may further include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory.

Processor 510 may be a general-purpose Processor, including a central processing unit, a Network Processor (NP), etc.; a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

The embodiment of the present application further provides a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the computer-executable instructions are executed by a processor, the method steps in the foregoing method embodiments are implemented, and the specific implementation manner and the technical effect are similar, and are not described herein again.

The embodiment of the application also provides a program product, and the program product comprises computer execution instructions. When the computer executes the instructions, the method steps in the above method embodiments are implemented in a similar manner and technical effects, which are not described herein again.

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, displayed data, etc.) referred to in the present application are information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the relevant laws and regulations and standards of the relevant country and region, and are provided with corresponding operation entrances for the user to choose authorization or denial.

Claims

1. A multi-modal interaction method, comprising:

acquiring vehicle environment information and user behavior information of a target vehicle;

determining emotion labels of a user according to the vehicle environment information and the user behavior information, wherein the emotion labels comprise positive labels and negative labels;

based on a vehicle-mounted scene, generating multi-modal output according to the emotion labels, wherein the vehicle-mounted scene comprises a parking scene and a driving scene, and the multi-modal output corresponding to different emotion labels is different.

2. The method of claim 1, wherein generating a multimodal output from the emotion tags based on the in-vehicle scene comprises:

if the vehicle-mounted scene is a driving scene, generating first content corresponding to the emotion label through a text generation technology and an image generation technology, and outputting the first content by adopting a text mode and an image mode;

or if the vehicle-mounted scene is a driving scene, generating second content corresponding to the emotion label through a text generation technology and an audio generation technology, and outputting the second content by adopting a text mode and an audio mode.

3. The method of claim 1, wherein generating a multimodal output from the emotion tags based on the in-vehicle scene comprises:

if the vehicle-mounted scene is a parking scene, generating third content corresponding to the emotion label through a text generation technology and a video generation technology;

and outputting the third content by adopting a text mode and a video mode.

4. The multimodal interaction method according to any one of claims 1 to 3, wherein the trigger condition for obtaining the vehicle environment information and the user behavior information of the target vehicle comprises: acquiring a user instruction;

the generating of multimodal output from the emotion labels based on the on-board scene includes:

and generating multi-modal output according to the emotion label and the user instruction based on the vehicle-mounted scene.

5. A multimodal interaction method as claimed in any of claims 1 to 3, wherein the determining of the emotion label of the user from the vehicle environment information and the user behaviour information comprises:

respectively determining emotion labels corresponding to the vehicle environment information and the user behavior information;

determining a first total number of active tags and/or a second total number of passive tags in the emotional tags;

in response to the first total number being greater than a first threshold, determining that the emotion label of the user is a positive label;

in response to the second total being greater than a second threshold, determining that the emotional tag of the user is a negative tag.

6. The multi-modal interaction method of claim 5, wherein the emotion tags further comprise neutral tags, the multi-modal interaction method further comprising:

in response to the first total number being less than or equal to a first threshold and the second total number being less than or equal to a second threshold, determining that the emotion tag of the user is a neutral tag;

generating fourth content through a text generation technology;

and outputting the fourth content by adopting a text mode.

7. A multimodal interaction method as claimed in any of claims 1 to 3, wherein the determining of the emotion label of the user from the vehicle environment information and the user behaviour information comprises:

and inputting the vehicle environment information and the user behavior information into a language representation model to obtain the emotion label of the user output by the language representation model.

8. The multimodal interaction method according to any one of claims 1 to 3, wherein the vehicle environment information comprises vehicle speed and gear, and before generating multimodal output according to the emotion label based on the vehicle-mounted scene, further comprising:

when the vehicle speed is 0 and/or the gear is a neutral gear, determining that the vehicle-mounted scene is a parking scene;

and when the vehicle speed is not 0 and the gear is not neutral, determining that the vehicle-mounted scene is a driving scene.

9. A multimodal interaction apparatus, comprising:

the acquisition module is used for acquiring vehicle environment information and user behavior information of the target vehicle;

a first determination module, configured to determine emotional tags of a user according to the vehicle environment information and the user behavior information, where the emotional tags include positive tags and negative tags;

the output module is used for generating multi-modal output according to the emotion labels based on a vehicle-mounted scene, wherein the vehicle-mounted scene comprises a parking scene and a driving scene, and the multi-modal output corresponding to different emotion labels is different.

10. An electronic device, comprising:

at least one processor;

and a memory communicatively coupled to the at least one processor;

wherein the memory is to store instructions executable by the at least one processor to enable the at least one processor to perform the multi-modal interaction method of any of claims 1 to 8.

11. A computer-readable storage medium having stored thereon computer-executable instructions for implementing the multi-modal interaction method of any one of claims 1 to 8 when executed by a processor.

12. A program product, characterized in that the program product contains computer executable instructions which, when executed, implement the multi-modal interaction method as claimed in any one of claims 1 to 8.