CN116825107A

CN116825107A - Voice interaction method and device, electronic equipment and storage medium

Info

Publication number: CN116825107A
Application number: CN202311052939.8A
Authority: CN
Inventors: 张茜; 徐智鹏; 刘博庭; 季泓苓
Original assignee: Beijing Jidu Technology Co Ltd
Current assignee: Beijing Jidu Technology Co Ltd
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-09-29
Anticipated expiration: 2043-08-21
Also published as: CN116825107B

Abstract

The disclosure provides a voice interaction method, a voice interaction device, electronic equipment and a storage medium, wherein the method comprises the following steps: displaying a voice avatar for indicating that the vehicle is currently in a voice function awake state; acquiring first voice information, and determining a plurality of keywords corresponding to the first voice information, wherein the keywords are used for guiding a target user to initiate second voice information; and displaying a voice response interface, wherein the voice response interface is used for displaying a plurality of keywords surrounding the voice avatar. The embodiment of the disclosure improves the lively experience when the voice interaction is carried out, provides the reference keywords for the user to send the voice instruction in the interaction process, and is beneficial to improving the efficiency of the voice interaction.

Description

Voice interaction method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of vehicles, in particular to a voice interaction method, a voice interaction device, electronic equipment and a storage medium.

Background

In order to facilitate the use of relevant vehicle service functions by the user without affecting the driving operation, the vehicle may provide the user with voice interaction functions. The user can send instructions to the vehicle through voice, the vehicle controls relevant functional components of the vehicle to execute corresponding voice instructions based on the voice instructions of the user, and the execution result is fed back through voice.

However, the existing vehicle lacks lively experience of voice interaction when in voice interaction with a user, so that interaction experience is poor, and interaction efficiency is low.

Disclosure of Invention

The embodiment of the disclosure at least provides a voice interaction method, a voice interaction device, electronic equipment and a storage medium.

In a first aspect, an embodiment of the present disclosure provides a voice interaction method, including:

displaying a voice avatar for indicating that the vehicle is currently in a voice function awake state;

acquiring first voice information, and determining a plurality of keywords corresponding to the first voice information, wherein the keywords are used for guiding a target user to initiate second voice information;

and displaying a voice response interface, wherein the voice response interface is used for displaying a plurality of keywords surrounding the voice avatar.

By adopting the implementation mode, the voice virtual image is displayed and used as a voice interaction object with the user, so that the user can interact with the vehicle in a lively experience; the voice response interface is displayed based on the first voice information of the user, and a plurality of keywords for the user to refer to and initiate the second voice information are displayed, and the keywords are displayed around the voice virtual image, so that the user can further initiate the voice question to provide guiding information while the vivid experience of the voice interaction is further improved, the thinking process of the user for thinking about the question content can be reduced to a certain extent, the voice interaction efficiency is improved, and the interactive experience of the user and the vehicle is improved.

In an alternative embodiment, the voice avatar is in a rotated state during the presentation, and the plurality of keywords are rotated synchronously around the voice avatar.

By adopting the embodiment, a plurality of keywords can be presented in a rotating mode, and the user can be attracted to pay attention.

In an alternative embodiment, the displaying of a plurality of the keywords surrounding the voice avatar includes:

and determining at least one surrounding path according to the number of the keywords, and displaying the keywords surrounding the voice avatar according to the at least one surrounding path.

By adopting the embodiment, when the number of the keywords is large, a plurality of surrounding paths can be displayed, so that the keywords can be rotationally displayed around different surrounding paths, and the visual experience is improved while the peripheral display resources of the voice virtual image are fully utilized.

In an alternative embodiment, the presenting a voice response interface includes:

displaying a plurality of the keywords surrounding the voice avatar in a first region of the voice response interface;

and displaying a first answer result corresponding to the first voice information in a second area of the voice response interface.

By adopting the embodiment, the voice response interface is divided into the first area and the second area, a plurality of keywords surrounded by the voice virtual image are displayed in the first area, and the answer result corresponding to the voice information initiated by the user is displayed in the second area, so that the relevant voice answer result can be displayed for the user while the relevant guide information is provided for the user.

In addition, the keywords matched with the answer result of the second area in the plurality of keywords can be placed in a highlighting state, for example, the first keywords matched with the first voice information initiated by the user can be included in the plurality of keywords and are placed in the highlighting state, so that the user can conveniently know which keyword information corresponds to the answer result of the second area, namely what the corresponding question information is, and the user can conveniently understand the answer result.

In an optional implementation manner, the displaying, in the second area of the voice response interface, the first answer result corresponding to the first voice information includes:

and displaying a first answer result corresponding to the first voice information in a second area of the voice response interface under the condition that the voice response interface is displayed on a first display screen corresponding to a front seat in the vehicle.

In this embodiment, in the case where there are a plurality of rows of seats in the vehicle, generally, the size of the display screen corresponding to the front row of seats is large, and therefore, a second area for displaying the voice response result may be provided on the first display screen corresponding to the front row of seats, and only the voice avatar and surrounding keywords may be displayed on the second display screen corresponding to the rear row of seats.

In a possible implementation manner, the displaying, in the second area of the voice response interface, the first answer result corresponding to the first voice information includes:

displaying a plurality of first answer results corresponding to the first voice information in the second area according to a matrix form; or,

displaying a plurality of first answer results corresponding to the first voice information at a first column position of the second area, and displaying geographic position information corresponding to the first answer results at a second column position of the second area; the geographic location information is located in the same line as the first answer result.

The embodiment provides two different answer result display modes, and under the first display mode, answer results can be displayed as many as possible for a user to select; in the second display mode, the geographic position information of the answer result can be displayed while the answer result is displayed, so that the user is assisted in further selection.

In one possible embodiment, before the displaying of the plurality of keywords surrounding the voice avatar, the method further includes:

and determining the current multi-user interaction scene.

In this embodiment, a plurality of surrounding keywords may be displayed in the case of the current multi-user interaction scenario (i.e., group chat scenario), that is, the presentation form of the voice interaction interface may be specific to the multi-user interaction scenario; in the multi-user interaction scene, through displaying a plurality of surrounding keywords, different users can respectively select the keywords and further initiate voice interaction, and the answer results of the common interaction of the plurality of users can be displayed based on the voice information of the different users.

In addition, the current multi-user interaction scene is determined, in one mode, the current multi-user interaction scene can be determined by keyword information in the first voice information which is initially input, for example, the first voice information contains keywords which are used for representing that a plurality of users participate in communication at present, such as ' we ', ' together ', ' and the like; alternatively, the behavior recognition may be performed on the in-vehicle environment image, and by performing the user behavior recognition on the in-vehicle environment image, it may be determined whether or not a plurality of users are currently performing communication behaviors such as "junction", "gazing each other", and the like.

In one possible implementation, the first voice information includes integrated voice information determined based on voice information input by a plurality of users.

That is, the first voice information may be integrated semantic information after the plurality of users input the voice information respectively, so that the integrated understanding of the multi-user voice input content may be realized, and an answer result meeting the common requirement of multiple users or the requirement after negotiation is provided.

Here, the corresponding response result may be provided at the interface every time when one user inputs voice information, and after the next user inputs voice information, the answer result may be updated by comprehensively understanding the voice information of the previous user and the voice information of the next user (possibly supplementing the voice question of the previous user, or possibly negating and further presenting a new question of the voice question of the previous user).

In a possible implementation manner, the first voice information is matched with a first keyword in the plurality of keywords, and the first keyword is in a highlighting state; the method further comprises the steps of:

acquiring second voice information, canceling the highlighting state of the first keyword in response to the second voice information matching with a second keyword in the plurality of keywords, placing the second keyword in the highlighting state, and rotating the second keyword to a front position of the voice avatar for displaying; and replacing the displayed first answer result with a second answer result matched with the second keyword.

In the embodiment, after the voice information of the user is matched with the new keyword, the highlighted keyword is switched, and the answer result is synchronously updated, so that the user can clearly observe the updated question keyword, and the answer result can be conveniently understood.

In a possible embodiment, the method further comprises:

acquiring third voice information;

and updating the voice response interface in response to the third voice information not matching the plurality of keywords, the updated voice response interface being for displaying a plurality of updated keywords surrounding the voice avatar.

In this embodiment, when the voice information newly input by the user does not match with the keywords in the voice response interface, it is explained that the user switches the question direction, and at this time, the keywords are updated for the newly input voice information so as to provide accurate guidance information for the new question direction of the user.

On the basis of the above embodiment, after updating the voice response interface, the method further includes:

and under the condition that the third voice information belongs to the command type voice information, updating prompt information through voice broadcasting, wherein the updating prompt information is used for prompting the current updated keywords.

In implementation, the voice command issued by the user may be command type (specifically, related command requirements are set forth), or may be trend type (contents discussed among users), and for command type voice information, the prompt information may be updated by voice broadcasting keywords so as to respond to the command of the user.

In a second aspect, an embodiment of the present disclosure further provides a voice interaction device, including:

the first display module is used for displaying a voice virtual image, and the voice virtual image is used for indicating that the vehicle is in a voice function awakening state at present;

the system comprises an acquisition module, a first voice information acquisition module and a second voice information acquisition module, wherein the acquisition module is used for acquiring the first voice information, determining a plurality of keywords corresponding to the first voice information, and the keywords are used for guiding a target user to initiate second voice information;

and the second display module is used for displaying a voice response interface, and the voice response interface is used for displaying a plurality of keywords surrounding the voice avatar.

In one possible embodiment, the voice avatar is in a rotated state during the presentation, and the plurality of keywords are synchronously rotated around the voice avatar.

In one possible implementation manner, the second display module is specifically configured to:

before displaying a plurality of keywords surrounding the voice avatar, determining that the user interaction scene is current.

In one possible implementation, the first voice information matches a first keyword of the plurality of keywords, the first keyword being in a highlighted state;

the acquisition module is further configured to: a second voice message is obtained and the second voice message,

the second display module is further configured to: in response to the second voice information matching a second keyword of the plurality of keywords, canceling the highlighting state of the first keyword, placing the second keyword in the highlighting state, and rotating the second keyword to a front position of the voice avatar for display; and replacing the displayed first answer result with a second answer result matched with the second keyword.

In one possible implementation, the obtaining module is further configured to: acquiring third voice information;

The second display module is specifically configured to: and updating the voice response interface in response to the third voice information not matching the plurality of keywords, the updated voice response interface being for displaying a plurality of updated keywords surrounding the voice avatar.

In one possible embodiment, the apparatus further comprises:

the voice feedback module is used for updating prompt information through voice broadcasting under the condition that the third voice information belongs to command type voice information after the second display module updates the voice response interface, and the update prompt information is used for prompting the current updated keywords.

In a third aspect, embodiments of the present disclosure further provide a computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication via the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of the first aspect, or any of the possible implementations of the first aspect.

In a fourth aspect, the presently disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the first aspect, or any of the possible implementations of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.

FIG. 1 illustrates a flow chart of a method of voice interaction provided by an embodiment of the present disclosure;

fig. 2 is a schematic diagram illustrating a voice avatar displayed on a car-to-machine interface in the voice interaction method according to the embodiment of the present disclosure;

fig. 3a is a schematic diagram illustrating a plurality of keywords displayed around a voice avatar in a voice interaction method according to an embodiment of the present disclosure;

fig. 3b is a schematic diagram illustrating keywords displayed around a voice avatar through a plurality of surrounding paths in the voice interaction method according to the embodiment of the present disclosure;

Fig. 4a is a schematic diagram showing answer results in a matrix form in the voice interaction method provided by the embodiment of the disclosure;

fig. 4b is a schematic diagram showing answer results in a double-row form in the voice interaction method provided by the embodiment of the disclosure;

fig. 5a is a schematic diagram illustrating a voice response interface displayed on a second display screen corresponding to a rear seat in the voice interaction method according to the embodiment of the present disclosure;

fig. 5b is a schematic diagram illustrating a voice response interface displayed on a first display screen and a second display screen simultaneously in a voice interaction method according to an embodiment of the disclosure;

FIG. 5c is a schematic diagram illustrating the relative positions of a first display screen and a second display screen within a vehicle;

FIG. 6 shows a schematic diagram of switching keywords of a voice hit;

fig. 7 is a schematic diagram illustrating switching of a plurality of keywords surrounding a voice avatar in a voice interaction method according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a voice interaction device according to an embodiment of the present disclosure;

fig. 9 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

Generally, the intelligent vehicle provides voice interaction service for users, but the interface display of voice interaction generally lacks lively experience, and generally only shows wave effects representing voice receiving states and recognized voice information, so that the interaction experience of the users is poor; in addition, the user needs to think about the questioning content in the voice interaction process, so that the interaction efficiency is low.

The embodiment of the disclosure aims to provide more vivid and better-experience voice interaction service for users, and can better adapt to the scene with multi-user interaction in the vehicle.

The term "and/or" is used herein to describe only one relationship, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, not all embodiments. The components of the present disclosure, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.

It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.

For the sake of understanding the embodiments of the present disclosure, first, a detailed description will be given of a voice interaction method disclosed in the embodiments of the present disclosure, where an execution body of the voice interaction method provided in the embodiments of the present disclosure is a processing device with a certain computing capability, such as a vehicle-mounted terminal in a vehicle, may be integrated in a domain controller, where the processing device has a connected display screen to display a relevant interaction interface. In some possible implementations, the voice interaction method may be implemented by a processor in the processing device invoking computer readable instructions stored in a memory.

The following examples illustrate the voice interaction method provided by the embodiments of the present disclosure.

Example 1

Referring to fig. 1, a flowchart of a voice interaction method according to an embodiment of the disclosure is shown, where the method includes steps S101 to S103, where:

s101: a voice avatar is displayed, the voice avatar being used to indicate that the vehicle is currently in a voice function awake state.

In a specific implementation, a user can speak a wake-up word in the vehicle to wake up the voice function in the vehicle; the wake word may include the name of the voice avatar or other specific wake word. After the voice function is awakened, the vehicle-to-machine interface displays a voice avatar, i.e., the voice avatar at the moment can indicate that the vehicle is currently in a voice function awakening state.

The voice avatar may employ a preset avatar, which may be an avatar of a avatar, a virtual animal, or the like, or may be an object avatar of a preset shape, such as a hexahedral avatar.

As shown in fig. 2, a schematic diagram of a voice avatar is displayed at a car interface, the voice avatar being a hexahedral avatar. In the car machine interface, the voice avatar may be presented in any interactive area. Here, special effects may be rendered on the voice avatar, such as an effect of dynamic change in the interior of the hexahedron, such as a bright switching special effect of the breathing lamp; in addition, the voice avatar may be in a rotated state during the display; by means of the mode that dynamic special effects are displayed inside and rotation is conducted, interaction can be displayed, and the user waits for further input of voice information. In addition, the voice virtual image can be presented in different forms from just beginning to complete loading and presenting, to the voice interaction process and to the radio stage after the voice interaction is finished, so that the voice interaction experience is further optimized.

In addition, in the voice interaction process, the voice virtual image can also have different display forms along with the change of the voice interaction state, for example, in different interaction stages of waiting for voice input, receiving voice information, providing answer results and the like, the voice virtual image can switch the display forms according to the change of the interaction stages, so that the effect of reminding a user is achieved, and the interaction experience is enriched.

S102: and acquiring the first voice information, and determining a plurality of keywords corresponding to the first voice information, wherein the keywords are used for guiding a target user to initiate the second voice information.

After the first voice information of the user is obtained, semantic analysis may be performed on the first voice information to determine keywords representing the key semantics of the first voice information, and in this embodiment, there are multiple methods for determining keywords, for example, in one manner, the semantic text of the first voice information may be segmented to obtain multiple segmented words, the relevance between each segmented word and the semantic information of the first voice information is determined respectively, some target segmented words with the relevance greater than the set relevance threshold are selected from the multiple segmented words, and then the guide keywords matched with the target segmented words are determined from the guide word library (including the keywords after the standardized processing) as the multiple keywords. In addition, the determination of the keywords can be completed by calling a pre-trained neural network model; the present disclosure does not limit a method of determining keywords in the first voice information, so that it can be implemented. Secondly, in this embodiment, recommendation guidance of related question keywords may also be performed according to the key semantics of the first voice information. Therefore, the plurality of keywords corresponding to the first voice information may include the keywords representing the key semantics of the first voice information and other recommended keywords.

In a specific implementation, when determining recommended keywords, it may be determined based on multiple screening dimensions, for example, the first voice information initiated by the user is "we go to a cafe bar together", the keyword representing the key meaning of the first voice information is "cafe", at this time, the object type associated with the first voice information may be determined according to the key meaning corresponding to the first voice information, multiple screening dimensions may be determined according to the object type, for example, for a cafe, belonging to a beverage consumption location, or, in the upper position, belonging to a restaurant location, and the recommended keywords may be determined from multiple screening dimensions such as distance, score, price, etc., for example, multiple keywords such as "distance is relatively close", "score is more than 4 minutes", "price is relatively cheap" may be recommended for further screening by the user.

In some cases, if a keyword of a valid semantic cannot be extracted from voice information input by a user, the current voice information may be considered invalid and is not processed.

In the embodiment of the disclosure, the multi-keyword surrounding presentation form may be specific to a multi-user interaction scenario. That is, in a single user scenario, a surrounding presentation form may not be adopted, for example, in a single user scenario, a plurality of recommended keywords may be presented in a static arrangement manner. In the multi-user interaction scene, in order to embody the interaction of multiple users, the multi-keyword surrounding and dynamic rotation mode is adopted for display.

In implementations, a number of ways of determining whether a multi-user interaction scenario is currently available may be employed. For example, scene recognition can be performed by keywords in voice information sent by a user, scene recognition can be performed by image information, scene recognition can be performed by sensor signals, and the like. The following description will be made separately.

One of the identification modes is as follows: identifying keywords;

in the mode, the current multi-user interaction scene is determined in response to the matching of the keyword information in the voice information and the target keyword information; the target keyword information is a keyword indicating that a multi-user interaction requirement exists.

For example, receiving voice information 'we drink coffee bars together', performing semantic understanding on the voice information to convert the voice information into text information, and then performing word segmentation on the converted text information to obtain each keyword: according to the preset keyword library under the multi-user interaction scene, the target keywords 'we' and 'together' in each keyword are determined to be matched with the multi-user interaction scene, and then the current interaction scene is determined to be the multi-user interaction scene.

Identification mode II: and (5) identifying image information.

In this way, according to the acquired scene image in the vehicle, when the multi-user interaction behavior in the vehicle is identified, the current multi-user interaction scene is determined.

For example, an environmental image in a vehicle can be obtained through a camera in the vehicle, the number of users in the environmental image and the interaction behavior among the users can be determined through image recognition, for example, the presence of a plurality of users in the vehicle is recognized, and the current interaction scene is determined to be a multi-user interaction scene when the behaviors such as 'contact lugs', 'eyes, interaction on gestures' and the like exist among the plurality of users.

And the identification mode is three: and (5) identifying a sensor signal. In this way, for example, the presence of users on a plurality of seats can be confirmed through signals fed back by the seat sensors, and the current interaction scene can be primarily judged to be a multi-user interaction mode.

After determining that the current interaction scene is a multi-user interaction scene according to the interaction scene recognition result, the method can enter a group chat mode to display the effect that the multi-keyword surrounds the voice virtual image to rotate.

In the multi-user interaction scenario, the first voice information may include integrated voice information determined based on voice information input by a plurality of users.

That is, the first voice information may be a comprehensive voice result after the plurality of users input voice information respectively, so that comprehensive understanding of the content of the voice input of multiple users can be realized, and an answer result meeting the common requirement of multiple users (the voice information of the multiple users is a progressive question, such as asking a cafe first, and asking a relatively close distance therein) or the requirement after negotiation (such as after one user speaks "go to cafe" and another user speaks "go to hotpot bar" again, and the hotpot is considered as the current requirement after negotiation).

S103: and displaying a voice response interface, wherein the voice response interface is used for displaying a plurality of keywords surrounding the voice avatar.

Here, the voice response interface is presented, including displaying a voice avatar and a plurality of keywords surrounding the voice avatar, wherein surrounding means that the plurality of keywords surrounds the voice avatar one turn according to a preset surrounding path. In a specific implementation, the diameter of the surrounding path can be determined according to the number of keywords to be displayed.

In the above description about the voice avatar, it has been explained that the voice avatar may be in a rotated state during the presentation, in which case a plurality of keywords corresponding to the first voice information may be rotated synchronously around the voice avatar.

As shown in fig. 3a, a schematic diagram is shown for surrounding a speech avatar with a plurality of keywords. The plurality of keywords presented around include: the keyword "coffee shop" representing the key semantics of the first speech information (this keyword may be in a selected state at this time), and a plurality of recommended keywords: "nearer to", "scoring above 4 points", "cheaper to price", the screening dimensions corresponding to different keywords are different. Wherein, the coffee shop is the key word hit currently, in the highlighting state; when the voice information newly input by the user hits other keywords, the previously hit keywords can be canceled, the newly hit keywords are placed in a highlighting state (such as a color highlighting and highlighting mode can be adopted), and in addition, the newly hit keywords can be rotated to be displayed in front of the voice avatar so as to attract the attention of the user better.

By adopting the method, a plurality of keywords can be presented in a rotating mode, the problem that some keywords are positioned behind the voice virtual image and cannot be clearly displayed is avoided, and the rotating mode can better attract users to pay attention, so that interaction experience is improved.

In addition, when a plurality of keywords surrounding the voice avatar are displayed, there may be one or more surrounding paths, and specifically, at least one surrounding path according to the number of the plurality of keywords may be determined, and the plurality of keywords surrounding the voice avatar are displayed according to the at least one surrounding path. Here, each keyword may be shown on only one surrounding path.

Therefore, when the number of the keywords is large, a plurality of surrounding paths can be displayed, so that the keywords can be rotationally displayed around different surrounding paths, and the visual experience is improved while the peripheral display resources of the voice virtual image are fully utilized.

As shown in fig. 3b, a schematic diagram showing keywords through multiple surrounding paths, in which, the keywords "cafe", "closer distance", "score above 4 minutes", "cheaper", and the keywords "more users are selected", "longer business hours", "more coffee products are selected" are shown on one surrounding path.

In a specific implementation, when the voice response interface is displayed, in addition to displaying the voice avatar in the first area and a plurality of keywords surrounding the voice avatar, a first answer result corresponding to the first voice information may be displayed in the second area of the voice response interface.

That is, the voice avatar and the plurality of keywords rotationally displayed around the voice avatar are displayed in a first area of the voice response interface, and the first area of the voice response interface may be an upper area in the voice response interface; in addition, the first answer result may be displayed in a second area of the voice response interface, the second area being a lower area of the voice response interface, adjacent to the first area in which the voice avatar is displayed.

In the embodiment of the disclosure, when a plurality of answer results are displayed, a voice prompt, such as a voice broadcast, may be accompanied by "find these, ask which one to want to go"; that is, the method of combining voice and interface display can be adopted to feed back results to the user, and the voice can play a role in reminding the user to pay attention to check the feedback results. In the implementation, besides the voice broadcasting reminding information, keywords displayed in the voice broadcasting interface, answer result introduction and the like can be also adopted.

In a specific implementation, the plurality of first answer results may be displayed in the voice response interface according to a set arrangement manner, where the arrangement manner may relate to the number of answer results, for example, when the number of answer results is large, the array display form of the matrix may be used to display more answer results, and when the number of answer results is relatively small, one column may be used to display answer results, and another column may be used to display corresponding geographic location information side by side with the corresponding answer results.

As shown in fig. 4a, a schematic diagram for displaying answer results in a matrix form is shown. And displaying a plurality of first answer results corresponding to the first voice information in a second area of the voice response interface according to a matrix form (two rows and three columns).

As shown in fig. 4b, a schematic diagram showing answer results in a double-row format is shown. Displaying a plurality of first answer results corresponding to the first voice information at a first column position of the second area, and displaying geographic position information corresponding to the first answer results at a second column position of the second area; the geographic location information is located in the same line as the first answer result.

In the embodiment of the disclosure, the voice response interface may be displayed on a first display screen corresponding to a front seat or may be displayed on a second display screen corresponding to a rear seat; fig. 5a is a schematic diagram showing a voice response interface on a second display screen corresponding to the rear seat. The second display screen corresponding to the rear seat can only display the voice virtual image and the surrounding keywords, so that the display space of the second display screen is saved, the rear passengers can conveniently send out subsequent voice instructions on the voice virtual image and the surrounding keywords, and the interaction experience of the rear passengers is improved.

In a specific implementation, the voice response interface may be displayed on the first display screen or the second display screen, or the voice response interface may be displayed on both the first display screen and the second display screen, where when the voice response interface is displayed on both the first display screen and the second display screen, a specific display content may be different. In one embodiment, in the case where there are a plurality of rows of seats in the vehicle, generally, the size of the display screen corresponding to the front row of seats is large, and thus, a second area for displaying the voice response result may be provided on the first display screen corresponding to the front row of seats, and only the voice avatar and surrounding keywords may be displayed on the second display screen corresponding to the rear row of seats.

Here, in the case where there are a plurality of rows of seats in the vehicle, generally, the rear seats of the vehicle correspond to the second display screen, and since the size thereof is generally smaller than the first display screen corresponding to the front seats, it is possible to display only the voice avatar and surrounding keywords on the second display screen. The second display screen may be located between two seats in the rear row or may be located in the rear of the backrest of the front row (in this case, a second display screen may be implemented in front of each rear row seat).

Fig. 5b is a schematic diagram showing a voice response interface displayed on the first display screen and the second display screen simultaneously. Since the size of the first display screen corresponding to the front seat is generally larger, a second area for displaying the voice response result can be set on the first display screen, and only the voice virtual model and surrounding keywords can be displayed on the second display screen corresponding to the rear seat.

FIG. 5c is a schematic diagram illustrating the relative positions of the first display screen and the second display screen within the vehicle; here, the first display screen is located at a front center position of the front seat, and the second display screen is located at a position between the two seats of the rear row.

That is, the case where the first answer result corresponding to the first voice information is displayed in the second area of the voice response interface may be a first display screen corresponding to a front seat in the vehicle.

In a specific implementation, after the surrounding keywords are displayed, the user may be interested in one of the keywords, and at this time, the interested keywords are triggered, and the switching of the voice answer result is performed.

Specifically, before switching, the first voice information is matched with a first keyword in the plurality of keywords, and the first keyword is in a highlighting state; thereafter, second voice information is acquired; in response to the second voice information matching a second keyword of the plurality of keywords, canceling the highlighting state of the first keyword, placing the second keyword in the highlighting state, and rotating the second keyword to a front position of the voice avatar for display; in addition, the first answer result displayed may be replaced with a second answer result (which may be a display effect for a display screen in front of the front seat) that matches the second keyword.

As shown in fig. 6, fig. 6 is a schematic diagram of switching keywords of a voice hit, wherein the first keyword of the first voice message hit is a "cafe", after displaying the cafe and the related recommended keywords, a user inputs second voice message, the second keyword of the second voice message hit recommendation is a "closer distance", at this time, the highlighting state of the first keyword is cancelled, the second keyword is placed in the highlighting state, and the "closer distance" of the second keyword is rotated to the front position of the voice avatar for displaying.

After the hit keywords are switched, the answer results may be updated accordingly, specifically, each answer result of the switching or the ranking of the answer results may be performed, for example, in the case that the hit keyword is a "coffee shop", each answer result may be ranked according to the comprehensive scores corresponding to the multiple screening dimension information, and after the hit keyword is switched to be a "distance relatively close", each answer result may be switched to be ranked according to the target screening dimension (distance) corresponding to the hit keyword, that is, the ranking is performed according to the sequence from the near to the far of the corresponding driving distance.

In a specific implementation, if the voice information initiated by the user does not hit any recommended keywords, the user may be considered to initiate a new question direction, and at this time, the recommended keywords may be updated according to understanding of the voice information initiated by the user.

Specifically, third voice information is acquired; and updating the voice response interface in response to the third voice information not matching the plurality of keywords, the updated voice response interface being for displaying a plurality of updated keywords surrounding the voice avatar.

Similar to the above description, the plurality of updated keywords may include keywords that characterize the semantics of the third voice information, and other keywords for questioning that are recommended based on the third voice information, and these recommended keywords may be the same as or different from the keywords that were previously recommended for the first voice information. For example, if the third voice information and the first voice information relate to similar places, the filtering dimensions corresponding to the front and rear voice information may be the same, and the recommended keywords may not need to be changed at this time. For example, after the key semantics of the "coffee shop" are switched to the key semantics of the "hot pot", the corresponding screening dimension may be unchanged, for example, the dimension of adding, distance, grading, and the like may be changed, for example, the dimension may be changed to taste evaluation, dish characteristics, service evaluation, and the like.

In addition, the displayed answer result can be updated, and the answer result displayed by the voice response interface can be switched to the answer result matched with the keyword hit by the third voice information. The switching display of the answer result may be directed to a case where the display screen is located in the front seat.

As shown in fig. 7, fig. 7 is a schematic diagram of switching a plurality of keywords surrounding a voice avatar. For example, the third voice information is "we are still eating the hot pot bar", at this time, the key semantic "hot pot store" of the third voice information is identified, and according to the key semantic, a plurality of recommended keywords "dish is distinctive", "better service", "taste is better", and a plurality of keywords surrounding the voice avatar are respectively switched to: "chaffy dish store", "dish feature", "better service", "taste better.

In a specific implementation, when the third voice information belongs to the command type voice information, the update prompt information is broadcasted through voice, and the update prompt information is used for prompting the current updated keyword.

In implementation, the voice command issued by the user may be command type (specifically, related command requirements are set forth), or may be trend type (contents discussed among users), and for command type voice information, the prompt information may be updated by voice broadcasting keywords so as to respond to the command of the user. For trend-type voice information, corresponding updated feedback information can be given only on the interface, so that excessive disturbance of discussion process among users is avoided.

In a specific implementation, according to the voice information further sent by the user, after any one of the plurality of answer results is hit, service interface information corresponding to the hit answer result can be displayed. For example, when the voice information further sent by the user hits the answer result 1, the answer result 1 is a specific coffee shop, at this time, the navigation page corresponding to the answer result 1 can be automatically switched to, and the navigation line to the destination corresponding to the answer result 1 is displayed.

In the implementation, after the voice information of the user hits the answer result and enters the service interface, the multi-user interaction scene can be automatically exited. In addition, there may be multiple exit modes such as active exit, overtime exit, interrupted exit, etc. For example, the user issues an instruction of "exiting group chat", or "exiting voice", or no new voice information is received within a preset period of time, or the vehicle-mounted communication system receives an external incoming call, and in these cases, the user may exit to display the voice avatar and switch to the interface before the voice avatar is displayed.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

Based on the same inventive concept, the embodiment of the disclosure further provides a voice interaction device corresponding to the voice interaction method, and since the principle of solving the problem by the device in the embodiment of the disclosure is similar to that of the voice interaction method in the embodiment of the disclosure, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

Example two

Referring to fig. 8, a schematic diagram of a voice interaction device according to an embodiment of the disclosure is shown, where the device includes: a first display module 801, an acquisition module 802, and a second display module 803; wherein,,

a first display module 801 for displaying a voice avatar for indicating that the vehicle is currently in a voice function awake state;

an obtaining module 802, configured to obtain first voice information, determine a plurality of keywords corresponding to the first voice information, where the keywords are used to guide a target user to initiate second voice information;

and a second display module 803 for displaying a voice response interface for displaying a plurality of the keywords surrounding the voice avatar.

In one possible implementation, the second display module 803 is specifically configured to:

the obtaining module 802 is further configured to: a second voice message is obtained and the second voice message,

the second display module 803 is further configured to: in response to the second voice information matching a second keyword of the plurality of keywords, canceling the highlighting state of the first keyword, placing the second keyword in the highlighting state, and rotating the second keyword to a front position of the voice avatar for display; and replacing the displayed first answer result with a second answer result matched with the second keyword.

In one possible implementation, the obtaining module 802 is further configured to: acquiring third voice information;

The second display module 803 is specifically configured to: and updating the voice response interface in response to the third voice information not matching the plurality of keywords, the updated voice response interface being for displaying a plurality of updated keywords surrounding the voice avatar.

In one possible embodiment, the apparatus further comprises:

the voice feedback module 804 is configured to update, after the second display module updates the voice response interface, update prompt information by voice broadcasting when the third voice information belongs to command type voice information, where the update prompt information is used to prompt the current updated keyword.

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

Corresponding to the text-to-speech interaction method in fig. 1, the embodiment of the disclosure further provides a computer device 900, as shown in fig. 9, which is a schematic structural diagram of the computer device 800 provided in the embodiment of the disclosure, including:

a processor 91, a memory 92, and a bus 93; memory 92 is used to store execution instructions, including memory 921 and external memory 922; the memory 921 is also referred to as an internal memory, and is used for temporarily storing operation data in the processor 91 and data exchanged with an external memory 922 such as a hard disk, the processor 91 exchanges data with the external memory 922 through the memory 921, and when the computer device 900 is running, the processor 91 and the memory 92 communicate with each other through the bus 93, so that the processor 91 executes the following instructions:

The disclosed embodiments also provide a computer readable storage medium having a computer program stored thereon, which when executed by a processor performs the steps of the voice interaction method described in the above method embodiments, wherein the storage medium may be a volatile or non-volatile computer readable storage medium.

The embodiments of the present disclosure further provide a computer program product, where the computer program product carries a program code, where instructions included in the program code may be used to perform steps of the voice interaction method described in the foregoing method embodiments, and specifically reference may be made to the foregoing method embodiments, which are not described herein in detail.

Wherein the above-mentioned computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present disclosure, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, and for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present disclosure may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored on a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present disclosure may be embodied in essence or a part contributing to the prior art or a part of the technical solution, or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present disclosure, and are not intended to limit the scope of the disclosure, but the present disclosure is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, it is not limited to the disclosure: any person skilled in the art, within the technical scope of the disclosure of the present disclosure, may modify or easily conceive changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features thereof; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A method of voice interaction, comprising:

2. The method of claim 1, wherein the voice avatar is in a rotated state during the presentation, and the plurality of keywords are rotated synchronously around the voice avatar.

3. The method of claim 1, wherein the displaying the plurality of keywords surrounding the voice avatar comprises:

4. The method of claim 1, wherein the presenting a voice response interface comprises:

5. The method of claim 4, wherein displaying the first answer result corresponding to the first voice information in the second area of the voice response interface comprises:

6. The method of claim 4, wherein displaying the first answer result corresponding to the first voice information in the second area of the voice response interface comprises:

7. The method of claim 1, further comprising, prior to displaying the plurality of keywords surrounding the voice avatar:

and determining the current multi-user interaction scene.

8. The method of claim 7, wherein the first voice information comprises integrated voice information determined based on a plurality of user-entered voice information.

9. The method of claim 4, wherein the first voice information matches a first keyword of the plurality of keywords, the first keyword being in a highlighted state; the method further comprises the steps of:

a second voice message is obtained and the second voice message,

in response to the second voice information matching a second keyword of the plurality of keywords, canceling the highlighting state of the first keyword, placing the second keyword in the highlighting state, and rotating the second keyword to a front position of the voice avatar for display;

and replacing the displayed first answer result with a second answer result matched with the second keyword.

10. The method according to claim 1, wherein the method further comprises:

acquiring third voice information;

11. The method of claim 10, further comprising, after updating the voice response interface:

12. A voice interaction device, comprising:

13. A computer device, comprising: a processor, a memory and a bus, said memory storing machine readable instructions executable by said processor, said processor and said memory communicating via the bus when the computer device is running, said machine readable instructions when executed by said processor performing the steps of the voice interaction method according to any of claims 1 to 11.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the voice interaction method according to any of claims 1 to 11.