CN112164401B - Voice interaction method, server and computer-readable storage medium - Google Patents

Voice interaction method, server and computer-readable storage medium Download PDF

Info

Publication number
CN112164401B
CN112164401B CN202010986263.XA CN202010986263A CN112164401B CN 112164401 B CN112164401 B CN 112164401B CN 202010986263 A CN202010986263 A CN 202010986263A CN 112164401 B CN112164401 B CN 112164401B
Authority
CN
China
Prior art keywords
scene
task
voice
voice information
current
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010986263.XA
Other languages
Chinese (zh)
Other versions
CN112164401A (en
Inventor
赵耀
易晖
唐乾斌
翁志伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Xiaopeng Motors Technology Co Ltd
Original Assignee
Guangzhou Xiaopeng Motors Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Xiaopeng Motors Technology Co Ltd filed Critical Guangzhou Xiaopeng Motors Technology Co Ltd
Priority to CN202010986263.XA priority Critical patent/CN112164401B/en
Publication of CN112164401A publication Critical patent/CN112164401A/en
Application granted granted Critical
Publication of CN112164401B publication Critical patent/CN112164401B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/30Distributed recognition, e.g. in client-server systems, for mobile phones or network applications
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The application discloses a voice interaction method of a vehicle. The voice interaction method comprises the following steps: performing scene task verification on the received voice information of the current round; updating scene task data corresponding to the voice information of the previous turn according to the scene task check result; and determining the dialogue action corresponding to the voice information of the current turn according to the updated scene task data so as to interact with the user. According to the vehicle voice interaction method, in the process of voice interaction between a user and a vehicle, the current turn voice is verified, so that the current graphical user interface is tracked in a dialogue state in combination with the user voice, the dialogue in the multi-turn interaction process is more continuous, the user is guided to accurately and completely express the operation intention according to the change of scene task data in the graphical user interface, and the intelligence and the user experience of the voice interaction are better. The application also discloses a server and a storage medium.

Description

Voice interaction method, server and computer-readable storage medium
Technical Field
The present application relates to the field of speech recognition technologies, and in particular, to a speech interaction method for a vehicle, a server, and a computer-readable storage medium.
Background
With the development of artificial intelligence technology, the voice intelligent platform or the voice assistant can recognize the voice input of the user and generate a corresponding operation instruction under a certain condition, so that great convenience is provided for the user to operate vehicle equipment, such as a central control display screen of a vehicle, and the voice intelligent platform or the voice assistant is widely applied. However, in the related art, the voice assistant cannot track the dialog state of the user and thus cannot perform continuous dialog with the user, and the intelligence is poor.
Disclosure of Invention
In view of the above, embodiments of the present application provide a voice interaction method for a vehicle, a server, and a computer-readable storage medium.
The application provides a voice interaction method of a vehicle, which comprises the following steps:
performing scene task verification on the received voice information of the current round;
updating scene task data corresponding to the voice information of the previous turn according to the scene task check result;
and determining the dialogue action corresponding to the voice information of the current turn according to the updated scene task data so as to interact with the user.
In some embodiments, the performing the scene task check on the received voice information of the current turn includes:
and checking the scene page task hit by the current round of voice information by using a preset paradigm of scene page operation.
In some embodiments, the scene task data includes a scene data tree, and updating the scene task data corresponding to the previous turn of voice information according to the result of the scene task check includes:
generating a scene data tree according to the vehicle graphical user interface information which corresponds to the current round voice information and the previous round voice information together;
and confirming the hit nodes in the scene data tree according to the scene task check result.
In some embodiments, the determining, according to the updated scene task data, a dialog action corresponding to the current turn of the voice information to interact with the user includes:
determining the number of root nodes of the scene data tree according to the hit nodes in the scene data tree so as to determine the number of executable tasks;
and determining the dialogue action corresponding to the current turn of voice information according to the executable task number.
In some embodiments, the determining a dialog action corresponding to the current turn of voice information according to the number of executable tasks includes:
and if the number of the executable tasks is 1, generating a dialogue action for executing the executable tasks.
In some embodiments, the determining a dialog action corresponding to the current turn of voice information according to the number of executable tasks includes:
and if the number of the executable tasks is not 1, generating a dialogue action for guiding or clarifying the executable tasks.
In some embodiments, the interaction method further comprises:
judging the turn times of the current turn voice information;
and if the round number does not reach the round number threshold value, storing the scene task data after updating.
In some embodiments, before performing the scene task check on the received voice information of the current round, the method further includes:
and loading scene task data corresponding to the voice information of the previous turn.
The application provides a server, the server includes:
the verification module is used for performing scene task verification on the received voice information of the current turn;
the updating module is used for updating scene task data corresponding to the voice information of the previous turn according to the scene task checking result;
and the determining module is used for determining the dialogue action corresponding to the voice information of the current turn according to the updated scene task data so as to interact with the user.
A non-transitory computer-readable storage medium containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the method of voice interaction of a vehicle.
In the vehicle voice interaction method, the server and the computer-readable storage medium, in the process of voice interaction between a user and a vehicle, the current turn of voice is verified, so that the current graphical user interface is subjected to dialogue state tracking in combination with the voice of the user, dialogue in the multi-turn interaction process is more continuous, the user is guided to accurately and completely express an operation intention according to the change of scene task data in the graphical user interface, and the intelligence of voice interaction and user experience are better.
Drawings
The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
FIG. 1 is a flow chart illustrating a method of voice interaction in some embodiments of the present application.
FIG. 2 is a block diagram of a server in accordance with certain embodiments of the present application.
FIG. 3 is a flow chart illustrating a method of voice interaction in some embodiments of the present application.
FIG. 4 is a flow chart illustrating a method of voice interaction in some embodiments of the present application.
FIG. 5 is a schematic diagram of a scene data tree in accordance with certain embodiments of the present application.
FIG. 6 is a flow chart illustrating a method of voice interaction in some embodiments of the present application.
FIG. 7 is a flow chart illustrating a method of voice interaction in some embodiments of the present application.
FIG. 8 is a schematic diagram of a scenario of a voice interaction method according to some embodiments of the present application.
FIG. 9 is a flow chart illustrating a method of voice interaction in some embodiments of the present application.
FIG. 10 is a schematic diagram of a scenario of a voice interaction method according to some embodiments of the present application.
FIG. 11 is a schematic diagram of a scenario of a voice interaction method according to some embodiments of the present application.
Detailed Description
Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
Referring to fig. 1, the present application provides a voice interaction method for a vehicle. The method comprises the following steps:
s10: performing scene task verification on the received voice information of the current round;
s20: updating scene task data corresponding to the voice information of the previous turn according to the scene task check result;
s30: and determining the dialogue action corresponding to the voice information of the current turn according to the updated scene task data so as to interact with the user.
The embodiment of the application provides a server. The server includes a communication element and a processor. The communication element is used for receiving the voice information of the current turn uploaded by the vehicle. The processor is used for updating scene task data corresponding to the voice information of the previous turn according to a scene task checking result, and is used for determining a dialogue action corresponding to the voice information of the current turn according to the updated scene task data so as to interact with a user.
Referring to fig. 2, an embodiment of the present application further provides a server 100, and the voice interaction method according to the embodiment of the present application may be implemented by the server 100 according to the embodiment of the present application.
Specifically, server 100 includes a verification module 102, an update module 104, and a determination module 106. S10 may be implemented by the check module 102, S20 may be implemented by the field update module 104, and S30 may be implemented by the determination module 106. Or, the checking module 102 is configured to perform scene task checking on the received voice information of the current turn. The updating module 104 is configured to update scene task data corresponding to the previous round of voice information according to a result of the scene task check. The determining module 106 is configured to determine, according to the updated scene task data, a dialog action corresponding to the current turn of the voice information to interact with the user.
In the vehicle voice interaction method and the server 100 according to the embodiment of the application, in the process of voice interaction between a user and a vehicle, scene task data corresponding to voice information of a previous turn is updated according to a result of scene task verification, and a dialogue action corresponding to the voice information of a current turn is determined according to the updated scene task data so as to interact with the user.
Specifically, the vehicle includes a display area, an electroacoustic element, a communication element, a processor, and the like. The display area of the vehicle may include a dashboard screen, an on-board display area screen, and a heads-up display that may be implemented on a vehicle windshield, among others. An on-board system operating on a vehicle presents the presented content to a User using a Graphical User Interface (GUI). The display area includes a number of UI elements, and different display areas may present the same or different UI elements. The UI elements may include card objects, application icons or interfaces, folder icons, multimedia file icons, and controls for making interactive operations, among others.
The intelligent display area of the vehicle can provide a convenient entrance for controlling the vehicle and interacting with the vehicle for a user, a voice assistant is added in the vehicle-mounted operating system, corresponding control instructions can be conveniently generated by recognizing voice under certain conditions, and convenience is further provided for interaction between the user and the vehicle. However, the voice assistant has limited capability of performing voice interaction, and generally can only interact with a preset operation target, but for an operation target which is not preset, the user cannot understand the real purpose of the user, and the user can only interact with the operation target by inputting the operation target in the graphical user interface, but cannot realize interaction through voice. Moreover, the voice interaction function is primary, continuous conversation with the user cannot be realized, the interaction process is often finished after one round of conversation, and whether the real intention of the user is really understood or not is judged. Or, the multiple rounds of conversations carried out before and after are not related and independent.
In the embodiment, after waking up the voice assistant, the user inputs voice information, and obtains the graphical user interface information of the vehicle-mounted system or the application program currently running in the vehicle system while obtaining the voice information. The graphical user interface information includes layout information of elements in the current graphical user interface, such as controls included in the current graphical user interface, types and positions of each control, association relationships between different controls, and the like.
Furthermore, the vehicle sends the locally acquired voice input information and graphical user interface information to the server at the cloud end, the server can acquire the graphical user interface interacted with the user and related content in the interface according to the graphical user interface information, accordingly, the graphical user interface information is used as auxiliary information, the voice input information is analyzed during semantic understanding, corresponding scene tasks are generated and transmitted back to the local vehicle, and then the vehicle executes corresponding operation according to the operation instruction.
It is to be understood that the graphical user interface information may allow the server to make a semantic understanding of what is contained in the current graphical user interface of the vehicle more explicit as seen by the user. During voice interaction, as if the user interacts with the graphical user interface, the interaction which can be implemented in the graphical user interface can define the target object during semantic understanding, so that the interaction between the user and the graphical user interface originally is realized in a voice mode.
And after the server performs semantic understanding on the received voice information of the current turn by combining a graphical user interface, performing scene task check on the operation of a scene page hit by the semantic understanding so as to determine whether a control hit by the semantic understanding can be executed, guided or clarified.
And if the historical conversation exists before the current turn, loading historical conversation information, or scene task data corresponding to the previous turn, and updating the scene task data according to the result of scene task verification, so that the conversation action of the voice information corresponding to the current turn is determined according to the updated scene task data.
And (6) performing a dialogue action. That is, the actions triggered by the current scene task state of the dialog include execution, guidance, clarification, confirmation, cancellation, termination, and the like. For example, if the dialog action is execution, an execution command will be issued. If the dialog action is a guide, clarify, or confirm, a conversational voice of the guide, clarify, confirm is generated and played.
In actual operation, the server has a task tracker, and the historical scene task data is stored in the task tracker. When the semantic meaning of the voice information is clear, the hit control and operation may be unique after the scene semantic understanding, and the generated dialogue action is to perform the execution operation on the hit control, where the execution operation includes the hit control, control operation, operation value, and other contents.
If the semantic of the voice information is not clear enough, or the semantic understanding of the scene is performed, for example, a plurality of controls are hit at the same time, or only the actions are hit without the controls, then the determined dialogue action cannot be generated, and a corresponding guidance or clarification technique needs to be generated and played through the electroacoustic element to form a dialogue with the user, that is, if the dialogue action is guidance, clarification or confirmation, feedback information of the guidance, clarification and confirmation speech term voice is generated and played.
In one example, the current graphical user interface is a window adjustment interface, a user interacts with a vehicle through voice, and wishes to control the window opening of the left front door, and sends a voice request "open the window 20% of the left front door", after scene semantic understanding, and after verification, the executable conditions of the click operation are a control and an operation, and confirms that the only control, namely the button of the window of the left front door, is hit, an execution instruction that the left front window is opened by 20% can be generated to issue the vehicle, and simultaneously, similar feedback information representing confirmation, such as "good, opening the left front window for you" and the like, can be generated to issue the vehicle together, and the feedback information is executed by the vehicle.
If the user sends out 'open the window', according to the scene semantic understanding, the hit controls in the graphical interface may include a plurality of controls such as a left front door window button, a left rear door window button, a right front door window button, a right rear door window button, a skylight button, and the like. In this case, since the control is not unique, the execution task cannot be directly generated, at this time, similar feedback information for guiding the user to express clear semantics, such as "asking which window to open for asking", can be generated and broadcasted by the vehicle, and the user can perform subsequent voice information input according to the feedback information until the upper limit of the number of conversation turns is reached, or the server defines the semantics.
Referring to fig. 3, in some embodiments, S10 includes:
s11: and checking the scene page task hit by the current round of voice information by using a preset paradigm of scene page operation.
In some embodiments, S11 may be implemented by the verification module 102, or the verification module 102 is configured to verify the scene page task hit by the current round of voice information using a predefined paradigm of scene page operations.
In some embodiments, the processor is configured to check the scene page task hit by the current round of voice information using a predefined paradigm of scene page operations.
Specifically, a paradigm check is performed on a scene page operation hit by a scene semantic understanding result, and whether the hit control meets an execution condition can be determined. For example, the scene semantic understanding result hits a button and the click operation thereof, the execution condition of the click operation is control and operation, and the button meets the check condition and can be executed. In the processing process, the returned results of the controls output by the semantic understanding result are checked in sequence, so as to obtain whether the controls can be executed, guided or clarified. And updating the scene task data according to the updating result of the executable verification so as to provide basis for confirming the dialogue action. For example, if two executable controls are confirmed after the scene task checks, the dialog action may be updated to be clear or directed. If the scene task is verified to have an executable control, the dialog action may be updated to execute.
And checking scene tasks of the control in the graphical user interface, and establishing association between the operation of the control and execution or guide clarification conditions. In this way, the operation type and relevant conditions of the control hit in the scene semantic understanding can be checked to judge whether the control can be executed, guided or clarified.
For example, for operations such as clicking, selecting, opening, closing and the like, the corresponding executable conditions are controls and operations. The execution conditions corresponding to the sliding operation are a control, an operation and a sliding value. The execution conditions corresponding to the single selection operation and the multiple selection operation are control, operation and selection sequence number. The execution conditions corresponding to the text input operation are control, operation and text. The execution conditions corresponding to the azimuth sliding operation are control, operation, direction and moving position.
In one example, the user utters a "window open" voice message. After scene semantic understanding, hit five buttons: "open left front door window", "open left rear door window", "open right front door window", "open right rear door window", "open skylight", and clicking operations thereof.
After verification, it can be known from the above that the execution conditions of the click operation are control and operation, that is, all five buttons included in the scene task are satisfied and are all executable tasks. Further, in conjunction with scene task data of a previous turn, such as contextual information or the like, a dialog action may be determined.
Referring to FIG. 4, in some embodiments, the scene task data includes a scene data tree. S20 includes:
s21: generating a scene data tree according to control attribute information of a vehicle graphical user interface which corresponds to the current round voice information and the previous round voice information together;
s22: and confirming the hit nodes in the scene data tree according to the scene task check result.
In certain embodiments, S21 and S22 may be implemented by the update module 104. Or, the updating module 104 is configured to generate a scene data tree according to the control attribute information of the vehicle graphical user interface corresponding to the current round voice information and the previous round voice information, and is configured to determine a node hit in the scene data tree according to a result of the scene task verification.
In some embodiments, the processor is configured to generate a scene data tree according to control attribute information of a vehicle graphical user interface to which current round voice information and previous round voice information jointly correspond, and is configured to confirm a hit node in the scene data tree according to a result of the scene task check.
Referring to fig. 5, specifically, the gui information takes a control in the gui as a unit, and the information includes a control identifier, a control type, a text description, an operation mode supported by control, an operation parameter, positions of multiple controls in the gui, and layout relationships of the multiple controls.
Wherein the control identification can be used to identify each element in the current graphical user interface, each element having a unique identification. The elements are also the content presented in the current graphical user interface, taking the information point card interface as an example, wherein the elements include information point names, addresses, collections, search peripheries, navigation routes, and the like.
The textual description, i.e., the manner in which the element is expressed in the graphical user interface, for example, for a favorites control, the textual description is "favorites".
The control type is the type of element presentation in the graphical user interface, such as a button, slider, status button, text entry box, check box, radio button, group button, toggle button, view, group, dialog box, etc. for the element.
The operation modes supported by the control, namely the operation which can be carried out by the control of the corresponding type, for example, the operation which can be supported by the button comprises clicking and checking, the operation which can be supported by the slider comprises sliding and checking, the operation which can be supported by the state button comprises clicking, sliding, checking, single selecting and multiple selecting, the operation which can be supported by the text input box comprises clicking, checking and inputting, the operation which can be supported by the text check box comprises clicking, multiple selecting and checking, the operation which can be supported by the single-selection button comprises clicking, single selecting and checking, the operation which can be supported by the group button comprises clicking, azimuth sliding and checking, the operation which can be supported by the switch button comprises clicking, opening, closing and checking, the operation which can be supported by the view comprises clicking, azimuth sliding, single selecting and checking, the operation which can be supported by the group comprises clicking and checking, and the operation of the dialog box comprises clicking and azimuth sliding.
The operation parameter corresponds to the degree of the operation mode, for example, the operation parameter corresponding to the click is short-press and long-press, and the operation parameter corresponding to the azimuth slide is large, medium, small, and the like.
Further, one or more applications may be included in the user graphical interface that are running simultaneously, and thus there may be multiple organizational controls. The controls can be constructed in a tree-like graph form, namely a scene data tree, each node in the tree structure represents one control, and the node attributes comprise control identification, control type, control label and the like. In this way, the control layout information in the current graphical interface can be represented by the scene data tree.
And generating a scene data tree according to the vehicle graphical user interface information which corresponds to the current round voice information and the previous round voice information together, and fusing time information (the current round voice information and the previous round voice information) and space information (scene data) at a data level.
For example, in the first-turn dialog, a user sends out 'window opening', a scene data tree corresponding to a scene page is constructed according to a current window control page, the number of executable tasks is 5 according to semantic understanding and scene task verification, tree structures from controls corresponding to the 5 executable tasks to a root node are reserved, other nodes are shielded, and updating of the scene data tree is completed. The updated scene data is stored in the task tracker for inheritance in the next round of dialog.
In the second round of conversation, the scene data tree of the previous round is inherited, and then state updating is further carried out on the technology of the data tree according to semantic understanding of speech information of the second round of conversation and a scene task checking result. That is, the updating of the scene data tree of each round is obtained based on the scene data tree reserved after the previous round of voice information screening according to the semantic understanding and the verification result of the voice information of the current round.
Referring to fig. 6, in some embodiments, S30 includes:
s31: determining the number of root nodes of the scene data tree according to the hit nodes in the scene data tree so as to determine the number of executable tasks;
s32: and determining the dialogue action corresponding to the current turn of voice information according to the number of the executable tasks.
In certain embodiments, S31 and S32 may be implemented by determination module 106. In other words, the determining module 106 is configured to determine the number of root nodes of the scene data tree according to the nodes hit in the scene data tree to determine the number of executable tasks, and determine the dialog action corresponding to the current turn of the voice information according to the number of executable tasks.
In some embodiments, the processor is configured to determine a number of root nodes of the scene data tree from the nodes hit in the scene data tree to determine a number of executable tasks, and to determine a dialog action corresponding to the current turn of the voice information from the number of executable tasks.
It is to be understood that after determining an executable task from nodes hit in the scenario data, it is necessary to further determine whether the executable task can be directly executed. I.e. the current round of speech information dialog. In the embodiment, the number of executable tasks is determined according to the number of root nodes of the scene data tree, and then the dialogue action corresponding to the current turn of voice information is determined according to the number of the executable tasks.
The scene data tree is used for representing an organization structure of the current graphical user interface, the root node is a view of the current graphical user interface, and the executable task number of the root node represents the total executable task number in the current graphical interface. In the execution process, the number of executable tasks of the nodes corresponding to the known executable tasks hit in the scene task check is counted, and then the number of the executable tasks of each node is accumulated from bottom to top until reaching the root node, namely, the number of the executable tasks of the current node is the sum of the number of the executable tasks of the child nodes of the current node.
For example, in the interaction process, a user sends out voice information of 'window opening', and five executable tasks such as 'opening left front door window', 'opening left rear door window', 'opening right front door window', 'opening right rear door window', 'opening skylight' and the like are hit after scene task verification.
And updating the state of the scene data tree, wherein the scene data tree comprises 5 executable controls, and the number of executable tasks is 5.
Further, the dialog state during the interaction process or the task parameters of the executable task are also recorded at the same time, and the dialog state record includes the initial parameters of the interaction process, the parameters during the interaction process and the like. The initial parameters comprise the initial set total round of voice interaction, and understandably, the set total round can effectively prevent the voice interaction from entering a dead cycle because accurate semantics cannot be acquired all the time. Parameters in the interaction process may include the turn of the current dialog, the number of tasks that can be performed, etc.
For example, the total round of voice interaction is set to 3 rounds, that is, if the current round is the third round of dialog and the first dialog information of the execution command still cannot be generated, the voice interaction with the user is ended, and if the current round is the first round or the second round of dialog and the execution command cannot be generated, guidance or clarification can be performed according to semantic understanding.
Referring to fig. 7, in some embodiments, S32 includes:
s321: and if the number of the executable tasks is 1, generating the dialogue action for executing the executable tasks.
In some embodiments, S321 may be implemented by the determination module 106. In other words, the determining module 106 is configured to generate a dialog action for executing the executable task if the number of executable tasks is 1.
In some embodiments, the processor is configured to generate a dialog action to execute the executable task if the number of executable tasks is 1.
Referring to fig. 8, specifically, when the number of executable tasks of the root node is 1, it is described that the executable task hit after semantic understanding and scene task checking is uniquely determined for the voice information of the current round, in this case, an executed dialog action may be generated, and specifically, an execution instruction for the control may be issued, including a control identifier, an operation mode of the control, and an operation parameter. And simultaneously, generating feedback information corresponding to the dialogue action by the text generation module according to the execution instruction.
For example, a user interacts with a vehicle through voice and sends out '20% of a left front door window to be opened', and according to the semantic understanding of the voice, a hit scene task comprises a button and clicking operation of the button. After the scene task is verified, the executable conditions of the clicking operation are control and operation, and the button meets the executable conditions. Through statistics, the number of executable tasks of the root node in the scene data tree is 1, the executable tasks can be directly executed, and the dialog action is determined to be executed. And generating an execution instruction that the left front window is opened by 20% to issue the vehicle, and simultaneously generating feedback information which is similar to the feedback information such as 'good, opening the left front window for you', 'the left front window is opened to 20%' and the like and indicates confirmation, and issuing the feedback information to the vehicle for broadcasting.
Referring to fig. 9, in some embodiments, S32 further includes:
s322: and if the number of the executable tasks is not 1, generating a dialogue action for guiding or clarifying the executable tasks.
In some embodiments, S322 may be implemented by determination module 106. In other words, the determining module 106 is configured to generate a dialog action to guide or clarify the executable task if the number of executable tasks is not 1.
In some embodiments, the processor is configured to generate a dialog action to direct or clarify the executable task if the number of executable tasks is not 1.
Referring to fig. 10, specifically, when the number of executable tasks of the root node is not 1, it indicates that the semantic understanding hits multiple scene tasks, and the scene tasks are all executable tasks after the scene task check, so that it is not determined which of the scene tasks is executed. In which case the executable task cannot be executed immediately and a boot or clarification is required.
The guidance is to guide the user to teach the user by using exemplary dialog information and guide the user to perform voice interaction in the style of the dialog information, so that the voice information input with more definite semantics can be performed. For example, "ask you for a feedback message that you are going to open the left front door window", "please give me an instruction again in the expression of opening the left front door window", and so on.
Clarification means that the user can explain and clarify the information which is unclear in the first round of conversation in the subsequent conversation in a query mode, so that the semantics of the user can be clarified. For example, "ask you which window to open", "ask how high you open" and so on.
For example, a user interacts with a vehicle through voice, and issues "open window", and the scene hit task according to semantic understanding includes five buttons, "open left front door window", "open left rear door window", "open right front door window", "open right rear door window", "open skylight", and click operations thereof. After the scene task is verified, the executable conditions of the clicking operation are control and operation, and the five buttons meet the executable conditions. And counting that the number of executable tasks of the root node in the data tree is 5, the executable tasks cannot be directly executed, and the dialog action is determined to be clarification or guidance. And generating feedback information for guiding or clarifying and issuing the feedback information to the vehicle, and broadcasting the feedback information by the vehicle.
In this case, a guided or clarified dialog action is generated, determined according to the task parameters. For example, currently there is a first turn of dialog, and since there are fewer turns, the upper limit of 3 turns is not reached, and dialog may be determined as clarification. For another example, currently, for the 2 nd round of dialog, the executable task cannot be uniquely determined according to the voice information of the round, so that the user needs to be guided to accurately express in the third round of dialog, otherwise, the interaction may be ended after the third round of dialog, and at this time, the dialog may be determined as the guidance.
In some embodiments, the interaction method further comprises:
judging the turn number of the current turn voice information;
and if the number of turns does not reach the turn threshold value, storing the updated scene task data.
In some embodiments, the above steps may be implemented by the server 100, that is, the server 100 is configured to determine the number of turns of the current turn of the voice information, and store the task data according to the updated scene if the number of turns does not reach the threshold number of turns.
In some embodiments, the processor is configured to determine a number of turns of the current turn of the voice information, and to store the task data according to the updated scene if the number of turns does not reach a threshold number of turns.
Specifically, in the interaction process, multiple rounds of interaction such as guidance, clarification, confirmation and the like are needed due to the shortage and redundancy of voice information contents in the previous round or the regret of a user to a historical conversation. Thus, in a multi-turn task, the range of information that the user is interested in is gradually reduced from the entire graphical user interface to interacting with the relevant controls mentioned above, i.e., being part of the graphical user interface. Therefore, part of the graphical user interface can be shielded and used as the scene information input of the current round of the voice information for scene semantic understanding. That is, the scene data tree of the previous turn is loaded, and then based on this, the scene data tree is updated after checking according to the hit control in the current turn of semantic understanding.
In the interaction process, the task tracker can be used for realizing the storage and reading of the conversation context. If the task is not finished and can not be executed, only the related nodes on the current scene state tree are saved. Such as: if the executable number is larger than 1, all nodes with the executable number larger than 0 are saved. Only the relevant controls are reserved to the root node, and the scene matching understanding module in the next round can reduce the matching threshold value, so that more natural conversation is realized.
If the current conversation is the first-turn conversation, a scene data tree is constructed according to scene data, then the scene data tree is updated according to the conditions of semantic understanding hit and scene task verification, a plurality of related executable controls are reserved to a root node, and the related executable controls are stored in a task tracker.
And the current round is a middle round, the updated scene data tree of each round is stored and loaded in the next round, and the scene range for semantic understanding is limited.
In some embodiments, the method of interacting further comprises, before S10:
and loading scene task data corresponding to the voice information of the previous turn.
In some embodiments, the above steps may be implemented by the server 100, that is, the server 100 is configured to load scene task data corresponding to the voice information of the previous turn.
In some embodiments, the processor is configured to load scene task data corresponding to the voice information of the previous turn.
Referring to fig. 11, specifically, in the subsequent session process, the scene task data tree of the previous turn in the task tracker is loaded first, and the scene data tree is updated according to the semantic understanding and the verification result of the current turn, so that it can be understood that, in the scene data tree of the previous turn, the information amount of the defined graphical user interface region is small after the operations such as hit matching, and the semantic understanding can reduce the matching threshold, thereby improving the accuracy of the multi-turn session.
For example, in the first-turn dialog, the user sends out "open a window", and according to semantic analysis and scene task verification, the number of executable tasks is 5, and then tree structures of controls corresponding to the 5 executable tasks to a root node are reserved, while other nodes are shielded, and updating of a scene data tree is completed. And simultaneously issuing feedback information dialogue information of asking which window to open.
When the dialog is performed next time, the user sends out voice information of opening the left front door window according to the first dialog information, the scene data number of the first dialog is used as input from the task tracker, semantic understanding is carried out on the basis of the input to determine that the scene task comprises a button of the left front door window and clicking operation of the button, the button meets executable conditions through scene task verification, updating is carried out on the basis of the first scene data tree, the number of executable tasks of the root node of the scene data tree is changed from 5 to 1, the executable dialog action can be directly executed, and meanwhile feedback information representing confirmation, such as 'the left front door window is opened for you', can be issued.
The embodiment of the application also provides a computer readable storage medium. One or more non-transitory computer-readable storage media containing computer-executable instructions that, when executed by one or more processors, cause the processors to perform the method of voice interaction of a vehicle of any of the embodiments described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), or the like.
The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method of voice interaction for a vehicle, comprising:
performing scene task verification on the received voice information of the current round;
updating scene task data corresponding to the voice information of the previous turn according to the scene task check result; the scene task data comprises a scene data tree, the scene data tree is used for representing an organization structure of a current vehicle graphical user interface, a root node of the scene data tree is a view of the current vehicle graphical user interface, and the executable task quantity of the root node represents the total executable task quantity in the current vehicle graphical interface;
and determining the dialogue action corresponding to the voice information of the current turn according to the updated scene task data so as to interact with the user.
2. The voice interaction method of claim 1, wherein the performing scene task check on the received voice information of the current turn comprises:
and checking the scene page task hit by the current round of voice information by using a preset paradigm of scene page operation.
3. The voice interaction method according to claim 1, wherein the updating the scene task data corresponding to the voice information of the previous turn according to the result of the scene task check comprises:
generating a scene data tree according to the vehicle graphical user interface information which corresponds to the current round voice information and the previous round voice information together;
and confirming the hit nodes in the scene data tree according to the scene task check result.
4. The method of claim 3, wherein the determining, according to the updated scene task data, a dialog action corresponding to the current turn of the voice information to interact with the user comprises:
determining the number of root nodes of the scene data tree according to the hit nodes in the scene data tree so as to determine the number of executable tasks;
and determining the dialogue action corresponding to the current turn of voice information according to the executable task number.
5. The method of claim 4, wherein the determining the dialog action corresponding to the current turn of the voice message according to the number of executable tasks comprises:
and if the number of the executable tasks is 1, generating a dialogue action for executing the executable tasks.
6. The method of claim 4, wherein the determining the dialog action corresponding to the current turn of the voice message according to the number of executable tasks comprises:
and if the number of the executable tasks is not 1, generating a dialogue action for guiding or clarifying the executable tasks.
7. The voice interaction method of claim 6, wherein the interaction method further comprises:
judging the turn times of the current turn voice information;
and if the round number does not reach the round number threshold value, storing the scene task data after updating.
8. The voice interaction method according to claim 1, wherein before performing the scene task check on the received voice information of the current turn, the method further comprises:
and loading scene task data corresponding to the voice information of the previous turn.
9. A server, characterized in that the server comprises:
the verification module is used for performing scene task verification on the received voice information of the current turn;
the updating module is used for updating scene task data corresponding to the voice information of the previous turn according to the scene task checking result; the scene task data comprises a scene data tree, the scene data tree is used for representing an organization structure of a current vehicle graphical user interface, a root node of the scene data tree is a view of the current vehicle graphical user interface, and the executable task quantity of the root node represents the total executable task quantity in the current vehicle graphical interface;
and the determining module is used for determining the dialogue action corresponding to the voice information of the current turn according to the updated scene task data so as to interact with the user.
10. A non-transitory computer-readable storage medium of computer-executable instructions, that when executed by one or more processors, cause the processors to perform the method of voice interaction of a vehicle of any of claims 1-8.
CN202010986263.XA 2020-09-18 2020-09-18 Voice interaction method, server and computer-readable storage medium Active CN112164401B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010986263.XA CN112164401B (en) 2020-09-18 2020-09-18 Voice interaction method, server and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010986263.XA CN112164401B (en) 2020-09-18 2020-09-18 Voice interaction method, server and computer-readable storage medium

Publications (2)

Publication Number Publication Date
CN112164401A CN112164401A (en) 2021-01-01
CN112164401B true CN112164401B (en) 2022-03-18

Family

ID=73858260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010986263.XA Active CN112164401B (en) 2020-09-18 2020-09-18 Voice interaction method, server and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN112164401B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421561B (en) * 2021-06-03 2024-01-09 广州小鹏汽车科技有限公司 Voice control method, voice control device, server, and storage medium
CN113900620B (en) * 2021-11-09 2024-05-03 杭州逗酷软件科技有限公司 Interaction method, device, electronic equipment and storage medium
CN113990299B (en) * 2021-12-24 2022-05-13 广州小鹏汽车科技有限公司 Voice interaction method and device, server and readable storage medium thereof
CN115376513B (en) * 2022-10-19 2023-05-12 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN115512705A (en) * 2022-11-22 2022-12-23 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116016578B (en) * 2022-11-22 2024-04-16 中国第一汽车股份有限公司 Intelligent voice guiding method based on equipment state and user behavior
CN115565532B (en) * 2022-12-02 2023-05-12 广州小鹏汽车科技有限公司 Voice interaction method, server and computer readable storage medium
CN116564316B (en) * 2023-07-11 2023-11-03 北京边锋信息技术有限公司 Voice man-machine interaction method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108701366A (en) * 2016-02-25 2018-10-23 高通股份有限公司 The start node of tree traversal for the shadow ray in graphics process determines
CN109616108A (en) * 2018-11-29 2019-04-12 北京羽扇智信息科技有限公司 More wheel dialogue interaction processing methods, device, electronic equipment and storage medium
CN109960537A (en) * 2019-03-29 2019-07-02 北京金山安全软件有限公司 Interaction method and device and electronic equipment
CN111002996A (en) * 2019-12-10 2020-04-14 广州小鹏汽车科技有限公司 Vehicle-mounted voice interaction method, server, vehicle and storage medium
CN111429895A (en) * 2018-12-21 2020-07-17 广东美的白色家电技术创新中心有限公司 Semantic understanding method and device for multi-round interaction and computer storage medium
CN111508482A (en) * 2019-01-11 2020-08-07 阿里巴巴集团控股有限公司 Semantic understanding and voice interaction method, device, equipment and storage medium
CN111639168A (en) * 2020-05-21 2020-09-08 北京百度网讯科技有限公司 Multi-turn conversation processing method and device, electronic equipment and storage medium

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488044A (en) * 2014-09-16 2016-04-13 华为技术有限公司 Data processing method and device
CN104360897B (en) * 2014-10-29 2017-09-22 百度在线网络技术(北京)有限公司 Dialog process method and dialog management system
US10298875B2 (en) * 2017-03-03 2019-05-21 Motorola Solutions, Inc. System, device, and method for evidentiary management of digital data associated with a localized Miranda-type process
CN109101537B (en) * 2018-06-27 2021-08-06 北京慧闻科技发展有限公司 Multi-turn dialogue data classification method and device based on deep learning and electronic equipment
CN111046150B (en) * 2018-10-15 2023-04-25 阿里巴巴集团控股有限公司 Man-machine interaction processing system and method, storage medium and electronic equipment
CN111401388B (en) * 2018-12-13 2023-06-30 北京嘀嘀无限科技发展有限公司 Data mining method, device, server and readable storage medium
CN109669754A (en) * 2018-12-25 2019-04-23 苏州思必驰信息科技有限公司 The dynamic display method of interactive voice window, voice interactive method and device with telescopic interactive window

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108701366A (en) * 2016-02-25 2018-10-23 高通股份有限公司 The start node of tree traversal for the shadow ray in graphics process determines
CN109616108A (en) * 2018-11-29 2019-04-12 北京羽扇智信息科技有限公司 More wheel dialogue interaction processing methods, device, electronic equipment and storage medium
CN111429895A (en) * 2018-12-21 2020-07-17 广东美的白色家电技术创新中心有限公司 Semantic understanding method and device for multi-round interaction and computer storage medium
CN111508482A (en) * 2019-01-11 2020-08-07 阿里巴巴集团控股有限公司 Semantic understanding and voice interaction method, device, equipment and storage medium
CN109960537A (en) * 2019-03-29 2019-07-02 北京金山安全软件有限公司 Interaction method and device and electronic equipment
CN111002996A (en) * 2019-12-10 2020-04-14 广州小鹏汽车科技有限公司 Vehicle-mounted voice interaction method, server, vehicle and storage medium
CN111639168A (en) * 2020-05-21 2020-09-08 北京百度网讯科技有限公司 Multi-turn conversation processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112164401A (en) 2021-01-01

Similar Documents

Publication Publication Date Title
CN112164401B (en) Voice interaction method, server and computer-readable storage medium
CN108090177B (en) Multi-round question-answering system generation method, equipment, medium and multi-round question-answering system
WO2022057152A1 (en) Voice interaction method, server, and computer-readable storage medium
CN111639168B (en) Multi-round dialogue processing method and device, electronic equipment and storage medium
CN106547678B (en) Method and apparatus for white-box testing
US20190354594A1 (en) Building and deploying persona-based language generation models
CN112102832B (en) Speech recognition method, speech recognition device, server and computer-readable storage medium
CN111611368B (en) Method and device for backtracking public scene dialogue in multiple rounds of dialogue
CN110928409A (en) Vehicle-mounted scene mode control method and device, vehicle and storage medium
US20060224778A1 (en) Linked wizards
CN111768780A (en) Voice control method, information processing method, vehicle and server
CN113421561B (en) Voice control method, voice control device, server, and storage medium
CN111813900B (en) Multi-round dialogue processing method and device, electronic equipment and storage medium
WO2024099046A1 (en) Voice interaction method, server and computer-readable storage medium
CN112735407B (en) Dialogue processing method and device
CN111813912A (en) Man-machine conversation method, device, equipment and storage medium
CN115129878B (en) Conversation service execution method, device, storage medium and electronic equipment
CN107894882B (en) Voice input method of mobile terminal
CN111144132B (en) Semantic recognition method and device
CN112784024B (en) Man-machine conversation method, device, equipment and storage medium
CN110211576A (en) A kind of methods, devices and systems of speech recognition
CN113409785A (en) Vehicle-based voice interaction method and device, vehicle and storage medium
CN113987149A (en) Intelligent session method, system and storage medium for task robot
CN103701671A (en) Method and device for detecting conflicts among businesses
CN109960489B (en) Method, device, equipment, medium and question-answering system for generating intelligent question-answering system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant