CN116030811B

CN116030811B - Voice interaction method, vehicle and computer readable storage medium

Info

Publication number: CN116030811B
Application number: CN202310294793.1A
Authority: CN
Inventors: 巴特尔; 孙仿逊; 曹川�; 李明洋; 李云飞
Original assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Current assignee: Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date: 2023-03-22
Filing date: 2023-03-22
Publication date: 2023-06-30
Anticipated expiration: 2043-03-22
Also published as: CN116030811A

Abstract

The application discloses a voice interaction method, which comprises the following steps: receiving a user voice request in a vehicle cabin; natural language understanding is carried out on the voice request, and intention information is obtained; predicting the current emotion of the user according to the intention information and the voice request; if the predictive value of the current emotion of the user is larger than a first threshold value, constructing a predictive data set according to the intention information by utilizing the standard slot information and the voice request; carrying out slot prediction on the user voice request according to the prediction data set; and finishing voice interaction according to the slot prediction result. According to the method and the device, the current emotion of the user can be predicted according to the result of natural language understanding of the user voice request, and the user is replied according to the result of groove prediction of the user voice request, so that voice interaction is completed. The voice interaction method can monitor the emotion of the user, generate natural language to reply the user, sooth the emotion of the user in time and reduce the influence of the negative emotion of the user on driving safety.

Description

Voice interaction method, vehicle and computer readable storage medium

Technical Field

The present disclosure relates to the field of vehicle-mounted voice technologies, and in particular, to a voice interaction method, a vehicle, and a computer readable storage medium.

Background

Currently, in-vehicle voice technology may support user interaction within a vehicle cabin via voice, such as controlling vehicle components or interacting with components in an in-vehicle system user interface. However, the current vehicle-mounted voice technology generally only carries out relevant processing on the content of the voice request of the user, ignores the emotion of the user during expression, and the emotion of the user may have a crucial effect on driving safety.

Disclosure of Invention

The application provides a voice interaction method, a vehicle and a computer readable storage medium.

The voice interaction method comprises the following steps:

receiving a user voice request in a vehicle cabin;

natural language understanding is carried out on the voice request, and intention information is obtained;

predicting the current emotion of the user according to the intention information and the voice request;

if the predictive value of the current emotion of the user is larger than a first threshold value, constructing a predictive data set according to the intention information by utilizing the standard slot information and the voice request;

performing slot prediction on the user voice request according to the prediction data set;

and finishing voice interaction according to the slot prediction result.

Therefore, the method and the device can perform natural language understanding on the voice request of the user in the vehicle cabin, and predict the current emotion of the user according to the obtained intention information. The method comprises the steps of carrying out slot prediction on a user voice request according to the prediction result of the current emotion of the user, replying the user according to the slot prediction result, and finally completing voice interaction. According to the voice interaction method, emotion of a user sending a voice request in the vehicle cabin can be monitored, natural language reply users are generated through prediction of the groove position of the voice request of the user, and finally voice interaction is completed. The emotion of the user can be timely calmed, and the influence of negative emotion of the user on driving safety is reduced.

The predicting the current emotion of the user according to the intention information and the voice request comprises the following steps:

acquiring a context voice feature of the user voice request and a vehicle feature corresponding to the context voice feature;

and predicting the current emotion of the user according to the intention information, the user voice request, the contextual voice features and the corresponding vehicle features.

In this way, the contextual speech features and corresponding vehicle features of the user's speech request may be derived from natural language understanding to predict the user's current emotion.

The obtaining the contextual phonetic feature of the user phonetic request and the vehicle feature corresponding to the contextual phonetic feature includes:

acquiring a historical voice request, a natural language understanding result corresponding to the historical voice request and a historical prediction result of the current emotion of a user;

and constructing a context sequence element according to the historical voice request, the corresponding natural language understanding result and the historical prediction result to obtain the context voice feature.

In this way, the context sequence element can be constructed by acquiring the historical voice request and the historical prediction result of the current emotion of the user, so that the context voice characteristic can be obtained, and the current emotion of the user can be predicted.

The step of constructing a context sequence element to obtain the context voice feature according to the history voice request, the corresponding natural language understanding result and the history prediction result, includes:

determining an interaction order, a user position and a voice text of the historical voice request;

determining domain information, intention information and slot position information of the historical voice request according to the natural language understanding result;

determining emotion scores and emotion vectors of the historical voice requests according to the historical prediction results;

and constructing the context sequence element according to the interaction sequence, the user position, the voice text, the domain information, the intention information, the slot position information, the emotion score and the emotion vector corresponding to the historical voice request, and obtaining the context voice characteristic.

In this way, according to the natural language understanding result corresponding to the historical voice request, the context sequence element is constructed, and the context voice feature is obtained, so that the current emotion of the user is predicted by combining the vehicle feature and the user voice request.

And acquiring vehicle cabin state information, driving information, communication information, entertainment information and driving safety information corresponding to the historical voice request, and constructing vehicle sequence elements to obtain the vehicle characteristics.

Therefore, vehicle sequence elements during voice interaction of a user can be constructed according to the multidimensional vehicle state information corresponding to the historical voice request, so that vehicle characteristics are obtained.

The step of completing voice interaction according to the slot prediction result comprises the following steps:

natural language generation is carried out according to the slot position prediction result;

feeding back an emotion pacifying strategy in the vehicle cabin according to a natural language generation result;

and receiving a confirmation request of a user in a vehicle cabin for the emotion soothing strategy, and controlling the vehicle to execute the emotion soothing strategy so as to complete voice interaction.

Thus, natural language generation can be performed according to the groove position prediction result, and emotion soothing strategies are fed back to the user. And after the user confirms, the vehicle is controlled to execute a corresponding emotion soothing strategy, so that the negative emotion of the user is soothing, the driving safety is ensured, and the user interaction experience is improved.

The natural language generation according to the slot prediction result comprises the following steps:

And generating natural language according to the slot information corresponding to the maximum predictive value larger than the second threshold value in the slot predictive result.

Therefore, the slot information with the predictive value larger than a certain threshold value can be selected according to the slot predictive result and used for natural language generation so as to feed back emotion soothing strategies to the user and sooth negative emotion of the user.

After the step of predicting the current emotion of the user according to the intention information and the voice request, the voice interaction method comprises the following steps:

if the predicted value of the current emotion of the user is larger than a third threshold value, natural language generation is performed according to the predicted result of the current emotion of the user and the intention information;

feeding back an emotion pacifying strategy in the vehicle cabin according to the natural language generated result;

Therefore, natural language generation can be performed according to the predicted value of the current emotion of the user, an emotion pacifying strategy is fed back to the user, and the vehicle is controlled to execute the corresponding emotion pacifying strategy after the user confirms the emotion pacifying strategy, so that the negative emotion of the user is pacified, driving safety is guaranteed, and user interaction experience is improved.

The vehicle of the present application comprises a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, implements the method described above.

The computer readable storage medium of the present application stores a computer program which, when executed by one or more processors, implements the method described above.

Additional aspects and advantages of embodiments of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of embodiments of the application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a schematic flow chart of a voice interaction method of the present application;

FIG. 2 is a second flow chart of the voice interaction method of the present application;

FIG. 3 is a third flow chart of the voice interaction method of the present application;

FIG. 4 is a schematic diagram of user mood swings for the voice interaction method of the present application;

FIG. 5 is a flow chart of a voice interaction method of the present application;

FIG. 6 is a fifth flow chart of the voice interaction method of the present application;

FIG. 7 is a flow chart of a voice interaction method of the present application;

FIG. 8 is a schematic diagram of vehicle cabin information for the voice interaction method of the present application;

FIG. 9 is a flow chart of a voice interaction method of the present application;

FIG. 10 is a flowchart illustrating a voice interaction method according to the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the embodiments of the present application and are not to be construed as limiting the embodiments of the present application.

With the development and popularization of vehicle electronic technology, a vehicle can perform voice interaction with a user, namely, the voice request of the user can be recognized, and finally, the intention in the voice request of the user is completed. The voice interaction function of the human and vehicle meets the diversified experiences of the driver and the passenger in the driving process. However, in the related art, the vehicle-mounted system only reacts to the instructional voice request issued by the user and executes the functional instruction therein, such as the instructional voice request of "turning on the air conditioner" or the like. Because of the diversified environment in the actual driving process, users may generate negative emotion during driving, such as abuse, and the driving safety is affected.

In addition, since the user makes a voice request with randomness, there may be a case where the corresponding intention cannot be detected. If the voice request is "I sweats at one's body", the negative emotion of the user to the temperature in the vehicle is obviously reflected, but the entity of the air conditioner cannot be extracted from the voice request, the user is required to explicitly say "turn on the air conditioner", the vehicle-mounted system can recognize the air conditioner, and the user interaction experience is poor.

Therefore, in the related art, negative emotion exhibited by a user through a voice request cannot be monitored and untwisted in time, driving safety is affected, and user interaction experience is poor.

Based on the above problems that may be encountered, referring to fig. 1, the present application provides a voice interaction method, including:

01: receiving a user voice request in a vehicle cabin;

02: natural language understanding is carried out on the voice request, and intention information is obtained;

03: predicting the current emotion of the user according to the intention information and the voice request;

04: if the predictive value of the current emotion of the user is larger than a first threshold value, constructing a predictive data set according to the intention information by utilizing the standard slot information and the voice request;

05: carrying out slot prediction on the user voice request according to the prediction data set;

06: and finishing voice interaction according to the slot prediction result.

The application also provides a vehicle including a memory and a processor. The voice interaction method can be realized by the vehicle. Specifically, the memory stores a computer program, the processor is configured to receive a user voice request in a vehicle cabin, perform natural language understanding on the voice request, obtain intention information, predict a current emotion of the user according to the intention information and the voice request, and if a predicted value of the current emotion of the user is greater than a first threshold, construct a predicted dataset according to the intention information by using each standard slot information and the voice request, perform slot prediction on the user voice request according to the predicted dataset, and complete voice interaction according to a slot prediction result.

Specifically, according to the voice interaction method, after receiving the user voice request in the vehicle cabin, natural language understanding can be performed on the user voice request, and intention information in sentences is obtained.

Processing the user voice request and the acquired intention information of the user voice request, and analyzing to obtain the intention information and the characteristics of the user voice request so as to predict the current emotion of the user. User emotion can be characterized from multiple dimensions, and the first dimension is that the safety coefficient is predicted, and the safety and the danger are classified. The dimension of the safety coefficient can be judged by judging the voice zone of the user making the voice request and the running state of the current vehicle. The second dimension characterizing the user's emotion is emotion classification, which may include positive emotion, anger, or anger, etc.

In a vehicle cabin, there are generally two roles, driver and passenger. As shown in table 1, the priority of the current user voice request processing may be determined by the voice zone judgment and the running state judgment of the vehicle.

TABLE 1

And after the emotion prediction of the driver in the main driving voice zone is completed and the emotion stability of the driver is obtained, the emotion prediction is carried out on passengers in other voice zones except the driver one by one.

Further, the predictive available predictive scores of the current emotion of the user are quantified. A first threshold may be set, and the user emotion may be classified according to a magnitude relation of the predictive score and the first threshold. For example, when the predicted value of the current emotion of the user is greater than the first threshold, it may be determined that the user currently generates a negative emotion, whereas the current emotion of the user is stable and does not generate a negative emotion. The first threshold may be set to a value of 0.5, 0.6, etc., and only the negative emotion generated by the user is judged, and the value is not set too high or too low. The specific value of the first threshold is not limited herein.

When the predicted value of the current emotion of the user is greater than the first threshold, for example, when the value of the first threshold is 0.5, and the predicted value of the current emotion of the user is 0.6, the reason why the negative emotion is generated by the user is further determined to determine a solution to the negative emotion. However, when the user is in a negative emotion, a standard instructional voice request may not be issued, and slot information may not be extracted from the voice request text. Thus, the prediction data set can be constructed by utilizing the standard slot information according to the intention information of the user voice request and combining the information in the voice request. The standard slot information may be explicit text directly pointing to the entity of the component in the vehicle, for example, the standard slot information corresponding to the voice request "air conditioner on low point" may be "reduce air conditioner temperature".

The prediction dataset includes an encoding of each word for the user voice request, and an encoding of each possible pointing slot information. In one example, if the user makes a voice request "i am sweaty", then before making a slot prediction, possible standard slot information is created, including the slot names and slot values as shown in table 2:

TABLE 2

Each word of the user's voice request "i am sweaty" is encoded to get [43, 56, 123, 546, 865, 76]. And encoding the possible slot information to construct a dataset, such as air conditioner [1000001, 1000002], screen [2000001, 2000002], temperature [3000001, 3000002], volume [4000001, 4000002], default [5000001, 5000002]. Wherein, each slot value under each slot name corresponds to a code, for example, the code of 1000001 for the slot name of air conditioner is on, and the code of 1000002 is off.

After encoding the user voice request and slot information, a structured dataset may be obtained as follows:

[[43, 56, 123, 546, 865, 76,1000001],

[43, 56, 123, 546, 865, 76, 1000002]

[43, 56, 123, 546, 865, 76, 2000001],

[43, 56, 123, 546, 865, 76, 2000002],

[43, 56, 123, 546, 865, 76, 3000001],

[43, 56, 123, 546, 865, 76, 3000002],

[43, 56, 123, 546, 865, 76, 4000001],

[43, 56, 123, 546, 865, 76, 4000002],

[43, 56, 123, 546, 865, 76, 5000001],

[43, 56, 123, 546, 865, 76, 5000002],]

the prediction of the slot position can be performed according to the prediction data set, that is, one piece of slot position information is selected from the prediction data set as a slot position prediction result. The result of the groove prediction can reveal the reason that the user generates negative emotion currently and serve as the basis for soothing the negative emotion of the user, and finally, voice interaction is completed.

In summary, in the present application, natural language understanding is performed on a user voice request in a vehicle cabin, and a current emotion of a user is predicted according to obtained intention information. The method comprises the steps of carrying out slot prediction on a user voice request according to the prediction result of the current emotion of the user, replying the user according to the slot prediction result, and finally completing voice interaction. According to the voice interaction method, emotion of a user sending a voice request in the vehicle cabin can be monitored, natural language reply users are generated through prediction of the groove position of the voice request of the user, and finally voice interaction is completed. The emotion of the user can be timely calmed, and the influence of negative emotion of the user on driving safety is reduced.

Referring to fig. 2, step 03 includes:

031: acquiring a context voice feature of a user voice request and a vehicle feature corresponding to the context voice feature;

032: and predicting the current emotion of the user according to the intention information, the user voice request, the contextual voice features and the corresponding vehicle features.

The processor is used for acquiring the contextual phonetic feature of the user phonetic request and the vehicle feature corresponding to the contextual phonetic feature, and predicting the current emotion of the user according to the intention information, the user phonetic request, the contextual phonetic feature and the corresponding vehicle feature.

In particular, as shown in FIG. 3, contextual speech features and vehicle features associated with a current user speech request may be obtained. The contextual speech features include various types of information contained in the user's speech request, which may be in a matrix form, the specific form not being limited herein. Various information in the context voice features is determined through natural language understanding results of historical voice requests before the current voice request is sent by the user, and historical prediction results of the emotion of the user are determined according to the natural language understanding results, so that the emotion of the current user is predicted.

The vehicle characteristics include various status characteristics of the vehicle itself. The vehicle characteristics corresponding to the current voice request can be obtained according to the contextual voice characteristics when the user sends the current voice request in the history process.

After obtaining the contextual speech features and the vehicle features of the user's speech request, the current emotion of the user making the speech request may be predicted in combination with the user's speech request and the intent information therein. The prediction process also needs to refer to each item of information corresponding to the historical voice request.

When predicting the current emotion of the user making the voice request, reference can be made to the fluctuation of emotion prediction results of the user in the process of multiple historical interactions. The line graph in fig. 4 shows the fluctuation of the user's emotion during 7 interactions. In the third interaction, the emotion prediction result is greater than 0.8, while in other orders of interaction, the emotion prediction result is less than 0.5. It is known that in the third interaction, the emotion of the user fluctuates greatly, possibly resulting in a negative emotion. Therefore, when the current emotion of the user fluctuates greatly, the user can be judged to possibly generate negative emotion, and pacifying is needed to ensure driving safety.

Referring to fig. 5, step 031 includes:

0311: acquiring a natural language understanding result corresponding to the historical voice request and a historical prediction result of the current emotion of the user;

0312: and constructing a context sequence element according to the historical voice request, the corresponding natural language understanding result and the historical prediction result to obtain the context voice characteristic.

The processor is used for acquiring the historical voice request, a natural language understanding result corresponding to the historical voice request and a historical prediction result of the current emotion of the user, and constructing a context sequence element according to the historical voice request, the corresponding natural language understanding result and the historical prediction result to obtain the context voice characteristic.

Specifically, in each interaction process of the user, natural language understanding needs to be performed on the voice request, as shown in fig. 3, to obtain a natural language understanding result of the voice request of the user, including intention information and slot position information of the voice request. And predicting the emotion of the current user according to the obtained intention information. Through multiple voice interactions of the user, the historical voice request and the corresponding natural language understanding result can be obtained, and the historical prediction result of emotion when the user sends the current voice request can also be obtained.

Further, context sequence elements may be constructed. The context sequence elements may include speech features of the historical speech request, as well as intent features, slot features, etc. of the historical speech request as understood by natural language. The state of the current context sequence element can be predicted according to the context sequence element of the historical voice request, and finally the context voice characteristic of the current voice request is obtained so as to predict the current emotion of the user.

Referring to fig. 6, step 0312 includes:

03121: determining the interaction sequence, the user position and the voice text of the historical voice request;

03122: determining domain information, intention information and slot position information of a historical voice request according to a natural language understanding result;

03123: determining emotion scores and emotion vectors of the historical voice requests according to the historical prediction results;

03124: and constructing a context sequence element according to the interaction sequence, the user position, the voice text, the field information, the intention information, the slot position information, the emotion score and the emotion vector corresponding to the historical voice request, and obtaining the context voice characteristic.

The processor is used for determining the interaction sequence, the user position and the voice text of the historical voice request, determining the domain information, the intention information and the slot information of the historical voice request according to the natural language understanding result, determining the emotion score and the emotion vector of the historical voice request according to the historical prediction result, and constructing a context sequence element according to the interaction sequence, the user position, the voice text, the domain information, the intention information, the slot information, the emotion score and the emotion vector corresponding to the historical voice request to obtain the context voice feature.

Specifically, as shown in fig. 7, a process of obtaining contextual speech features is described. When the user interacts with the vehicle-mounted system each time, the vehicle-mounted system can construct context sequence elements of the historical voice request according to time sequence in the process of carrying out natural language understanding on the voice request, and the interaction sequence of the historical voice request is determined. For example, in most cases, the voice request issued by the user immediately after getting on the vehicle is mostly related to the environment in the vehicle, such as "open air conditioner" or "window ventilation", etc., and then the interactive sequence of the current voice request may be represented by { "sequence_id": "0" }. Subsequently, the interactive order of the user making the voice request "play hills of Li Zongcheng" for the second time may be represented by { "sequence_id": "1" }.

The user position can be judged according to the voice zone, and the identity of the user making the voice request is judged to be a driver or a passenger. In the scene of the intelligent vehicle, four sound areas in the vehicle can be vectorized, the position of a user sending a voice request in a vehicle cabin can be judged, and the user can request the position through the voice in the related field of the requirement, such as the navigation related requirement of more drivers in the main driving sound area, and the requirements of more other sound area users in the fields of multimedia, air conditioning and the like. When a plurality of voice requests are to be processed, the emotion of the main driving user can be processed preferentially through judging the voice request voice area, and driving safety is guaranteed to a certain extent.

When distinguishing the voice request voice zone, different identifiers, such as numbers or letters, can be used to respectively represent the relative positions of users in the vehicle seat cabin, and in the small-sized vehicle, 0 corresponds to a primary driving, 1 corresponds to a secondary driving, 2 corresponds to a rear row behind the primary driving, and 3 corresponds to a rear row behind the secondary driving. The user location may be noted as { "location": "0" }, when the host user makes a voice request. The specific identification of the distinguishing user is not limited herein.

The vehicle-mounted system can also obtain a voice text corresponding to the historical voice request of the user. For example, when the user makes a voice request "hills playing Li Zongcheng" may be denoted as { "query": "hills playing Li Zongcheng" }, so that the emotional characteristics of the user's voice request are interpreted directly by analysis of the voice text by the natural language understanding model.

Further, in the process of natural language processing of the historical voice request and the current voice request of the user, multidimensional information in the historical voice request can be acquired based on an embedded technology, wherein the multidimensional information comprises field information, intention information and slot position information of the historical voice request.

The voice interaction process of the vehicle-mounted system can support multiple fields, including: navigation, music, vehicle control, air conditioning, talk, charge stake, weather, and other question and answer content, etc. When a user sends out a voice request, the vehicle-mounted system can predict which topic field supported by the vehicle-mounted system the current voice request belongs to through natural language understanding, vectorize is carried out, the data set of the voice request field information is stored, and the characteristic information of the voice request of the user is enriched. In the process of each interaction between the user and the vehicle-mounted system, the vehicle-mounted system can infer the semantic field of the current order voice request according to the field information of the historical voice request.

The voice interaction function of the vehicle-mounted system can recognize the intention of the voice request of the user. When a user sends out a voice request, the on-board system can predict the intention contained in the current voice request through natural language understanding, vectorize the intention, store the intention information in a data set of the voice request, and then extract the intention information in the voice request of the user more accurately. In some vehicle-mounted systems, more than 700 voice requests can be identified with different intentions.

In addition, for user voice requests of different intentions, there is slot information corresponding thereto. The slot information includes a slot name and a slot value. For example, a user voice requests that the slot of "turn on the air conditioner" be named "air conditioner" and the slot value be "turn on". And carrying out normalization processing on the slot position information to obtain a normalized value without dimension. The method can be used for carrying out parallel vectorization on the slot names, the slot values and the normalized values of the slot information, storing the slot names, the slot values and the normalized values of the slot information into a data set of the slot information of the voice request, and enabling the accuracy to be higher when the slot extraction is carried out on the voice request of the user subsequently.

Finally, the emotion score and emotion vector of the historical voice request may be determined based on the historical prediction results. The emotion score of each of the historical voice requests may be involved in the calculation of the emotion vector, i.e., the emotion vector carries the emotion score information of each of the historical voice requests. The emotion vector of the historical voice request is convenient for judging whether the user who sends the voice request is in negative emotion.

In one example, the second user voice request that is often received by the voice assistant after the voice assistant starts working is "play the hills of Li Zongcheng", and then, according to the corresponding interaction order, user position, voice text, domain information, intention information, and slot information of the historical voice request, a context sequence element is constructed: { "sequence_id": "1", "location": "0", "query": "play a hills of Li Zongcheng", "domain": "music", "intent": "music_search_space", "slot": [ { "name": "music_singer@music", "value": "Li Zongcheng" }, { "name": "music_song@music", "value": "hills" }

In the above example, after the user sends out the voice request "play the hills of Li Zongcheng" to interact with the vehicle-mounted system, the corresponding emotion scores can be predicted through the model, and the corresponding emotion vectors can be obtained. The emotion score and emotion vector may be expressed as: { "project_score": "0.042", "emo _casting": [0.21,0.41,0.51,0.11] }.

Finally, the emotion score, emotion vector, and other context sequence elements, including interaction order, user location, voice text, domain information, intent information, slot information corresponding to the historical voice request, together form a contextual voice feature. As shown in fig. 7, the derived contextual speech features may be co-engaged with the vehicle features in predicting the current user's emotion.

Step 031 includes:

and acquiring vehicle cabin state information, driving information, communication information, entertainment information and driving safety information corresponding to the historical voice request to construct vehicle sequence elements to obtain vehicle characteristics.

The processor is used for acquiring vehicle cabin state information, driving information, communication information, entertainment information and driving safety information corresponding to the historical voice request so as to construct vehicle sequence elements to obtain vehicle characteristics.

Specifically, in an intelligent vehicle scenario, vehicle characteristics may be obtained to determine vehicle parameters such as a vehicle type. The sensitivity degree of vehicles of different vehicle types to sensing points such as vehicle speed is different.

Vehicle cabin information corresponding to the historical voice request can be obtained, and the vehicle cabin information comprises analysis of sensing point states of windows, doors, locks and the like. As shown in fig. 8, a tree diagram of part of sensing point information included in the windows and locks in the vehicle cabin information shows the correspondence between each vehicle part and sensing point in the vehicle cabin.

Similarly, driving information in the acquired historical voice request may also include vehicle speed, vehicle position, gear, etc.

Communication information including bluetooth, talk, and volume settings, etc.

Entertainment information mainly includes various multimedia states in the vehicle.

Driving safety information including weather conditions, probability of rainfall, humidity, air quality, etc.

In the memory, the multidimensional vehicle state information corresponding to the historical voice request can be obtained, when a user interacts with the vehicle-mounted system, the vehicle-mounted system can monitor the vehicle through the sensing point to obtain the vehicle cabin state information, driving information, communication information, entertainment information and driving safety information at the current moment, and the vehicle sequence element is constructed by combining the corresponding information of the historical voice request. The vehicle sequence element may be, for example, an arrangement of specific parameter values of the vehicle cabin status information, the driving information, the communication information, the entertainment information, and the driving safety information when the user makes a voice request. According to the vehicle sequence element, the vehicle characteristics when the user sends out the voice request can be obtained.

In some examples, the user makes a voice request "in the car is smoothy", the intention of which is "open the window". And deducing that the air quality index in the vehicle is possibly lower than a certain threshold value in the current vehicle sequence element according to the vehicle state information corresponding to the historical voice request. If the weather sensing point of the vehicle determines that the vehicle is currently rainy, the user can be warned, for example, "is currently raining, is the danger of opening windows, is the window still to be opened for ventilation? "etc.

Referring to fig. 9, step 06 includes:

061: natural language generation is carried out according to the slot position prediction result;

062: feeding back emotion pacifying strategies in the vehicle cabin according to the natural language generation result;

063: and receiving a confirmation request of a user in the vehicle cabin for the emotion soothing strategy, and controlling the vehicle to execute the emotion soothing strategy so as to complete voice interaction.

The processor is used for generating natural language according to the groove position prediction result, feeding back emotion soothing strategies in the vehicle cabin according to the natural language generation result, receiving confirmation requests of users in the vehicle cabin for the emotion soothing strategies, and controlling the vehicle to execute the emotion soothing strategies so as to complete voice interaction.

Specifically, the method can predict the slot position of the user voice request to obtain a slot position prediction result, and perform natural language generation according to the slot position prediction result to be used as slot position information in the generation reply speech operation.

When the negative emotion of the user is judged to be higher according to the emotion score and the emotion vector of the user, the emotion soothing strategy can be fed back to the user according to the result generated by the natural language. For example, in one example, if a user sends a voice request "give me a canteen to a canteen, and if a response sentence generated by natural language is not a question, and then the user may send a voice request to blame a voice assistant, the emotion soothing policy, such as" the honored owner, your emotion is abnormal, millions of roads, safe first, whether you need soothing music?

After feeding back the mood-soothing strategy to the user in the vehicle cabin, the user may confirm the mood-soothing strategy. The user replies a positive voice request such as 'ok', and after confirming the emotion soothing strategy, the vehicle can be controlled to execute the corresponding emotion soothing strategy. In the above example, after receiving a confirmation request for "soothing music" from a user in the vehicle cabin, the music player in the vehicle may be controlled to play the soothing music to pacify the emotion of the user, and complete the voice interaction.

Step 061 comprises:

And the processor is used for generating natural language according to the slot information corresponding to the maximum predictive value of the slot predictive result being greater than the second threshold value.

Specifically, a slot prediction result obtained by performing slot prediction on a user voice request includes a slot name possibly corresponding to a text of the voice request and a slot value corresponding to the slot name. The slot position prediction can be used to obtain the slot position information generated by natural language so as to generate natural language and obtain the answer speech operation.

The method for selecting the slot information for natural language generation includes the steps of firstly, calculating a slot value in the slot information to obtain a predictive value. When the predicted value corresponding to one or more slot values is greater than the second threshold value in a plurality of slot values corresponding to one or more slot names, the slot information corresponding to the predicted value greater than the second threshold value is selected as the slot information generated by the subsequent natural language. The second threshold is used for screening condition information which accords with the condition and is used for natural language production, for example, the threshold can be set to be 0.5 or 0.6, and then natural language generation can be carried out on the slot information corresponding to the predictive value which is larger than the second threshold. The specific value selected for the second threshold is not limited herein.

In one example, the user makes a voice request "I am out of a sweat," each word of the user voice request "I am out of a sweat" may be encoded to get [43, 56, 123, 546, 865, 76].

Possible slot information as shown in table 2 needs to be extracted for encoding, constructing a dataset such as air conditioner [1000001, 1000002], screen [2000001, 2000002], temperature [3000001, 3000002], volume [4000001, 4000002], default [5000001, 5000002], before generating natural language text for feedback emotion soothing strategy. Wherein, each slot value under each slot name corresponds to a code, for example, the code of 1000001 for the slot name of air conditioner is on, and the code of 1000002 is off.

Finally, predictive values [0.53, 0.2, 0.04, 0.08, 0.03, 0.02, 0.01, 0.02, 0.05, 0.02] for each set of data can be calculated from the slot values. When the second threshold is set to 0.5, only the predicted value corresponding to the slot name "air conditioner" and the corresponding slot value "open" is greater than the second threshold, so that the slot is selected to be { "slot_name": "ac", "slot_value": "open" }, and finally natural language generation is performed, and the user is asked whether the air conditioner needs to be opened, for example, "do the air conditioner need to be opened? ".

In some other examples, if there are two or more slot information in table 2, whose maximum predictive value of the corresponding slot value is greater than the second threshold, then natural language generation is performed using the eligible plurality of slot information simultaneously. For example, when a user voice request "i am sweaty, vexation" is issued, the slot name "air conditioner" and its corresponding slot value "on", and the slot name "music" and its corresponding slot value "play" all satisfy the condition that the corresponding predictive value is greater than the second threshold. A reply "predict you are now hot, bad, do you need to turn on the air conditioner and play music? "

Referring to fig. 10, after step 03, the voice interaction method includes:

07: if the predicted value of the current emotion of the user is larger than a third threshold value, natural language generation is performed according to the predicted result and intention information of the current emotion of the user;

08: feeding back an emotion pacifying strategy in the vehicle cabin according to the result generated by the natural language;

09: and receiving a confirmation request of a user in the vehicle cabin for the emotion soothing strategy, and controlling the vehicle to execute the emotion soothing strategy so as to complete voice interaction.

The processor is used for generating natural language according to the prediction result and the intention information of the current emotion of the user if the prediction value of the current emotion of the user is larger than a third threshold value, feeding back an emotion pacifying strategy in a vehicle cabin according to the result of the natural language generation, receiving a confirmation request of the user for the emotion pacifying strategy in the vehicle cabin, and controlling the vehicle to execute the emotion pacifying strategy so as to complete voice interaction.

Specifically, when the predicted value of the current emotion of the user is greater than the third threshold, natural language generation is required according to the predicted result and the intention information of the current emotion of the user. The third threshold may be set to a larger value between 0 and 1, and is generally set to 0.8 or more, for example, 0.8, 0.85, etc. The third threshold with higher value can pacify only the negative emotion with higher danger coefficient under the condition of not frequently disturbing the user, so that the driving safety is guaranteed to the greatest extent.

In practice, users may blame the voice assistant for poor experiences in the interaction process. In one example, after the user sends a voice request "give me a canteen supermarket", and no feedback from the voice assistant is obtained, the user sends a voice request "aiya" again, you are really bad. The process of obtaining the predictive value of the current emotion of the user according to the contextual phonetic features is as follows:

{"sequence_id": "2",

"location": "0"

"query": "Aiyao, you really bad,

"domain": "system",

"intent": "system_error_feedback"

"slot" :[]

"emotion_score": "0.86", "emo_embedding": [0.64,0.74,0.84,0.34]}

the predictive value of the current emotion of the user is 0.86, and the negative emotion of the user is higher. When the value of the third threshold is 0.8, the predictive value of the current emotion of the user is greater than the third threshold, and emotion soothing is needed. Natural language generation is performed according to the current emotion prediction result and intention information to obtain a result of' honored car owners, you are abnormal in current emotion, tens of millions of roads are safe, and whether you need relaxed music? ".

Further, the process of feeding back the natural language generation result to the user so as to provide the emotion soothing strategy for the user in the vehicle cabin can be realized by means of voice broadcasting, text reminding and the like, and the specific propagation mode is not limited herein.

After feeding back the emotion soothing strategy to the user in the vehicle cabin, the user can confirm the emotion soothing strategy according to own requirements. If the user replies a positive voice request such as 'ok', and the like, after confirming the emotion soothing strategy, the vehicle can be controlled to execute the corresponding emotion soothing strategy. In the above example, after receiving a confirmation request for "soothing music" from a user in the vehicle cabin, the music player in the vehicle may be controlled to play the soothing music to pacify the emotion of the user, and complete the voice interaction.

Particularly, if the user refuses the emotion pacifying strategy through the voice request, the vehicle cabin interior user can be warned in a voice broadcasting mode and the like, and the driving safety is confirmed to the user in the vehicle cabin. And under the condition that the user confirms the running safety of the vehicle, the alarm can be ended.

The computer readable storage medium of the present application stores a computer program which, when executed by one or more processors, implements the methods described above.

In the description of the present specification, reference to the terms "above," "specifically," "particularly," "further," and the like, means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable requests for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the present application, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the present application.

Claims

1. A method of voice interaction, comprising:

receiving a user voice request in a vehicle cabin;

natural language understanding is carried out on the user voice request, and intention information is obtained;

Predicting the current emotion of the user according to the intention information, the user voice request, the contextual voice features and the corresponding vehicle features;

if the predictive value of the current emotion of the user is larger than a first threshold value, constructing a predictive data set according to the intention information by utilizing each standard slot information and the user voice request;

and finishing voice interaction according to the slot prediction result.

2. The voice interaction method according to claim 1, wherein the obtaining the contextual voice feature of the user voice request and the vehicle feature corresponding to the contextual voice feature comprises:

3. The method of claim 2, wherein said constructing a context sequence element from said historical voice request and corresponding said natural language understanding result and said historical prediction result to obtain said contextual voice feature comprises:

and constructing the context sequence element according to the interaction sequence, the user position, the voice text, the domain information, the intention information, the slot position information, the emotion score and the emotion vector corresponding to the historical voice request to obtain the context voice characteristic.

4. The voice interaction method according to claim 2, wherein the obtaining the contextual voice feature of the user voice request and the vehicle feature corresponding to the contextual voice feature comprises:

5. The voice interaction method according to claim 1, wherein the completing the voice interaction according to the slot prediction result comprises:

6. The voice interaction method according to claim 5, wherein the generating natural language according to the slot prediction result comprises:

7. The voice interaction method according to claim 1, wherein after the step of predicting a current emotion of a user based on the intention information, the user voice request, the contextual voice feature and the corresponding vehicle feature, the voice interaction method comprises:

8. A vehicle comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, implements the method of any of claims 1-7.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by one or more processors, implements the method according to any of claims 1-7.