WO2022057152A1 - 语音交互方法、服务器和计算机可读存储介质 - Google Patents

语音交互方法、服务器和计算机可读存储介质 Download PDF

Info

Publication number
WO2022057152A1
WO2022057152A1 PCT/CN2020/140940 CN2020140940W WO2022057152A1 WO 2022057152 A1 WO2022057152 A1 WO 2022057152A1 CN 2020140940 W CN2020140940 W CN 2020140940W WO 2022057152 A1 WO2022057152 A1 WO 2022057152A1
Authority
WO
WIPO (PCT)
Prior art keywords
task
result
understanding
voice request
information
Prior art date
Application number
PCT/CN2020/140940
Other languages
English (en)
French (fr)
Inventor
赵耀
易晖
翁志伟
唐乾斌
Original Assignee
广州橙行智动汽车科技有限公司
广州小鹏汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州橙行智动汽车科技有限公司, 广州小鹏汽车科技有限公司 filed Critical 广州橙行智动汽车科技有限公司
Publication of WO2022057152A1 publication Critical patent/WO2022057152A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the technical field of speech recognition, and in particular, to a speech interaction method, server and computer-readable storage medium for vehicles.
  • the voice intelligent platform or voice assistant can recognize the user's voice input under certain conditions and generate corresponding operation instructions, which provides an extremely useful tool for users to operate vehicle equipment, such as the vehicle's central control display.
  • vehicle equipment such as the vehicle's central control display.
  • great convenience and is widely used.
  • the related art usually only uses the currently received speech information, and lacks more dimensional information, resulting in poor quality of human-computer interaction and poor user experience.
  • embodiments of the present application provide a voice interaction method, a server and a computer-readable storage medium for a vehicle.
  • the present application provides a voice interaction method for a vehicle, comprising:
  • the decision is made according to predetermined rules to determine the task corresponding to the task action.
  • the fusion processing of the result of semantic understanding of the voice request includes:
  • the performing global semantic understanding on the voice request to obtain a first understanding result includes:
  • performing natural language understanding processing on the voice request information according to a pre-stored template to obtain a first understanding result includes:
  • the first comprehension sub-result and the second comprehension sub-result are fused to obtain the first comprehension result.
  • the performing scene semantic understanding on the voice request to obtain a second understanding result includes:
  • performing natural language understanding processing on the voice request information in combination with a graphical user interface to generate a second understanding result includes:
  • the third comprehension sub-result and the fourth comprehension sub-result are fused to obtain the second comprehension result.
  • the verification of the result of the fusion processing to update the dialog state information to determine the task action corresponding to the voice request includes:
  • the dialog state information is updated according to the result of the executable check.
  • the updating the dialog state information according to the result of the executable check includes:
  • the task action corresponding to the voice request is determined according to the task quantity.
  • the determining according to a predetermined rule to determine the task corresponding to the task action includes:
  • the priority of the voice request information of the first layer and the priority of the voice request information of the second layer are merged to determine the task.
  • the layered processing of the input information for semantic understanding to obtain the first layer of voice request information and the second layer of voice request information includes:
  • the feature information is divided into the first-layer voice request information by taking the dialogue state information as an element;
  • the feature information is divided into the second layer voice request information with semantic understanding as an element.
  • the strategy for respectively matching the first-layer voice request information and the second-layer voice request information according to a predetermined strategy and obtaining the corresponding priority includes:
  • the performing fusion processing on the priority of the voice request information of the first layer and the priority of the voice request information of the second layer to determine the task includes:
  • the task with the higher score is determined as the task of the task action.
  • This application provides a server, including:
  • the semantic fusion module is used to fuse the results of semantic understanding of the voice request
  • an action determination module configured to determine a task action corresponding to the voice request according to the result of the semantic understanding
  • the task determination module is configured to make a decision according to a predetermined rule to determine a task corresponding to the task action.
  • the present application provides a server including a memory and a processor, wherein a computer program is stored in the memory, and the computer program implements the voice interaction method when executed by the processor.
  • the present application provides a non-volatile computer-readable storage medium containing computer-executable instructions that, when executed by one or more processors, implement the voice interaction method.
  • FIG. 1 is a schematic flowchart of a voice interaction method according to some embodiments of the present application.
  • FIG. 2 is a schematic block diagram of a server according to some embodiments of the present application.
  • 3 to 7 are schematic flowcharts of voice interaction methods according to some embodiments of the present application.
  • FIG. 8 is an interactive schematic diagram of a voice interaction method according to some embodiments of the present application.
  • 9 to 10 are schematic flowcharts of voice interaction methods according to some embodiments of the present application.
  • FIG. 11 is an interactive schematic diagram of a voice interaction method according to some embodiments of the present application.
  • the present application provides a voice interaction method for a vehicle.
  • a voice interaction method for a vehicle include:
  • S30 Make a decision according to a predetermined rule to determine the task of the corresponding task action.
  • the embodiments of the present application provide a server.
  • the server includes a communication element and a processor.
  • the communication element is used to receive voice requests uploaded by the vehicle.
  • the processor is configured to perform semantic understanding on the received voice request, and to determine a task action corresponding to the voice request according to the result of the semantic understanding and the dialog state information, and to make decisions according to predetermined rules to determine the task action corresponding to the voice request task.
  • the embodiment of the present application further provides a server 100, and the voice interaction method of the embodiment of the present application can be implemented by the server 100 of the embodiment of the present application.
  • the server 100 includes a semantic fusion module 102 , an action determination module 104 and a task determination module 106 .
  • S10 may be implemented by the semantic fusion module 102
  • S20 may be implemented by the action determination module 104
  • S30 may be implemented by the task determination module 106 .
  • the semantic fusion module 102 is configured to perform fusion processing on the result of semantic understanding of the speech request.
  • the action determination module 104 is configured to verify the result of the fusion processing to update the dialog state information to determine the task action corresponding to the voice request.
  • the task determination module 106 is configured to make a decision according to a predetermined rule to determine the task of the corresponding task action.
  • the voice interaction method and server 100 for a vehicle during the interaction between the user and the vehicle, while understanding the semantics, the task and the execution action of the task are determined in combination with the dialogue history, and the multi-dimensional information is used to accurately Understand the real intention of the user, and the intelligence and user experience of voice interaction are better.
  • the vehicle includes a display area, an electro-acoustic element, a communication element, a processor, and the like.
  • the display area of the vehicle may include a dashboard screen, a vehicle-mounted display area screen, a head-up display that can be implemented on the windshield of the vehicle, and the like.
  • the onboard system running on the vehicle uses a Graphical User Interface (GUI) to present the displayed content to the user.
  • GUI Graphical User Interface
  • the display area includes many UI elements, and different display areas can display the same or different UI elements.
  • the UI elements may include card objects, application icons or interfaces, folder icons, multimedia file icons, and controls for interacting and operating. Among them, the display area of the vehicle can provide users with a convenient entrance to control the vehicle and interact with the vehicle. Adding a voice assistant to the on-board operating system can easily generate corresponding control commands by recognizing voice under certain conditions, further providing users Interaction with the vehicle is facilitated.
  • a voice request is sent.
  • the user controls the vehicle through the current graphical user interface and issues a voice control command.
  • Relevant playback control instructions for multimedia playback are issued in the multimedia graphical user interface.
  • the graphical user interface information of the in-vehicle system or application program currently running by the vehicle system is uploaded to the scene information database of the server in real time.
  • the graphical user interface information includes layout information of elements in the current graphical user interface, such as the controls included in the current graphical user interface, the type and location of each control, and the association relationship between different controls.
  • natural language understanding can be performed according to the correlation in the scene database.
  • the GUI information is also the scene data information, which is based on the control in the GUI.
  • the information includes the control identifier of the control, the control type, text description, the operation mode supported by the control, the operation parameters, and the Related properties such as position and layout relationship in the interface.
  • the control identifier can be used to identify each element in the current GUI, and each element has a unique identifier.
  • Elements are also the content presented in the current GUI. Taking the information point card interface as an example, the elements include information point names, addresses, favorites, search around, and navigation routes.
  • Text description that is, how the element is expressed in the GUI. For example, for a favorite control, its text description is "favorite".
  • the control type is also the element presentation type of the element in the GUI, such as button, slider, state button, text input box, check box, radio button, group button, switch button, view, group, dialog box etc.
  • the operation mode supported by the control that is, the operations that can be performed by the corresponding type of control.
  • the operations supported by buttons include click and selection
  • the operations supported by sliders include slide and selection
  • the operations supported by status buttons include click, slide, and selection.
  • single selection and multiple selection the operations supported by the text input box include clicking, selecting and entering text
  • the operations supported by the group button include click, azimuth swipe and selection
  • the operations supported by the switch button include click, open, close and selection
  • the operations supported by the view include click, azimuth swipe, single selection and selection
  • the group can support
  • the operations include clicks and selections
  • the operations of dialog boxes include clicks and azimuth swipes.
  • the operation parameter corresponds to the degree of the operation mode.
  • the operation parameter corresponding to a click is a short press or a long press
  • the operation parameter corresponding to the azimuth sliding is large, medium and small.
  • the positions and layout relationships of multiple controls in the interface reflect the layout information of related elements in the graphical user interface, which is equivalent to providing visual information to the server so that the server can obtain the graphical user interface seen by the user.
  • the semantic understanding of the voice request can integrate multi-channel information, including global semantics, scene semantics and so on.
  • the global semantics is to parse the voice request into structured fields domain-intent-slot without combining the GUI information for semantic understanding
  • the scene semantics refers to the semantic understanding of the voice request combined with the GUI information, such as understanding the control serial number of the operation, execution Actions, control values, etc.
  • the semantic understanding of each channel includes multiple processing methods. For example, different models are used, and the output semantic understanding results are also different.
  • the fusion processing in this application refers to the fusion of the results of global semantic understanding performed in different ways, and the fusion of the results of scene semantic understanding performed in different ways. Exactly which fusion result of semantic understanding is used to determine the task requires further decision-making.
  • the server also has a dialogue state database, and the dialogue state database records historical data of multiple rounds of dialogues, for example, including the contextual content of the historical dialogue, the number of dialogue rounds, the task tree diagram and other information.
  • the data in the dialogue state database is updated during each round of dialogue.
  • the update basis may include, but is not limited to, the result of scene semantic understanding, the last moment or historical dialogue state information, and the like.
  • the task action of the current voice request can be determined according to the updated dialogue state information, and the task action refers to the direction of the task corresponding to the voice request, such as clarifying, guiding, executing or failing, rather than a specific task.
  • the specific tasks are determined according to multiple inputs such as dialogue state information and scene data information in predetermined rules. It is understandable that the fusion results of different semantic understandings may correspond to different tasks as input information. In this case, relevant The rules evaluate the priorities of different tasks, so as to determine the task currently requested by the voice, and execute the task according to the task action.
  • the user sends a voice request to "turn up", according to semantic understanding, the fusion result of global semantic understanding hits the operation of raising up, and the fusion result of scene semantic understanding hits multiple controls that can be adjusted in height, etc. result.
  • the current is the first round of the dialogue, and the task action of the current round is determined as the guide.
  • Combining semantic understanding with scene data to determine the task is the guiding task.
  • the system can give feedback "what do you want to increase, try to say increase the temperature to me”.
  • the user sends a voice request to "turn up the brightness”.
  • the fusion result of scene semantic understanding hits two sliders with adjustable brightness.
  • the fusion result of global semantic understanding hits the object of the raised operation action and brightness.
  • the current dialogue is the second round, and the current round is determined.
  • the next task is to clarify. Combining semantic understanding with scene data to determine the task is a clarification task.
  • the Mission GUI provides a selection list of adjustable objects, the first being the instrument brightness and the second the onboard display area screen brightness. The system can feedback "find the brightness of the instrument and the brightness of the large screen, which one to increase”.
  • the user sends a voice request for "the first".
  • the fusion result of global semantic understanding hits serial number 1, and the fusion result of scene semantic understanding hits a meter, and the brightness can be increased.
  • the task of the current round is determined as execution, combined with semantic understanding and The task of determining the scene data is to perform the task of sliding the brightness slider of the instrument screen to the right, and the system feedback "the brightness of the instrument has been increased”.
  • S10 includes:
  • S11-S13 may be implemented by the semantic fusion module 102, or in other words, the semantic fusion module 102 is used to perform global semantic understanding on the speech request to obtain the first understanding result, and is used to perform scene semantics on the speech request comprehension to get a second comprehension result.
  • the processor is configured to perform global semantic understanding of the speech request to obtain a first understanding result, and to perform scene semantic understanding of the speech request to obtain a second understanding result.
  • the process of semantic understanding of the voice request is divided into two executions. Specifically, one of them is to perform natural language understanding processing on the voice request according to the pre-stored template without combining the GUI information. The other way is to perform natural language understanding processing on voice requests in combination with GUI information. Based on the above two strategies, language understanding processing is performed on the speech request respectively to obtain a first understanding result and a second understanding result.
  • natural language understanding of voice input information based on a pre-stored template can ensure the recall of generalization of input information to a certain extent, and combined with GUI information can ensure the consistency of voice interaction and graphical user interface to improve accuracy.
  • using different strategies for language understanding processing realizes the complementary advantages of different strategies, taking into account the recall rate and accuracy rate, so that the effect of speech recognition is better.
  • the user sends a voice request related to opening the car window on the GUI of the window adjustment
  • the voice request and the GUI information of the window control page are uploaded to the server, and the server performs a natural process on the voice request according to the pre-stored template.
  • the first comprehension result is obtained through language understanding processing
  • the second comprehension result is obtained through language understanding processing combined with GUI information. For example, if the voice request issued by the user is "open", multiple results may be recalled according to a predetermined template, and the first comprehension result hits the global Operation Open, recalls all switchable objects such as windows, doors, lights, etc.
  • the second understanding result of the result of opening the vehicle window can be returned.
  • S11 includes:
  • S111 Perform natural language understanding processing on the voice request information according to a pre-stored template to obtain a first understanding result.
  • S111 may be implemented by the semantic fusion module 102, or in other words, the semantic fusion module 102 is configured to perform natural language understanding processing on the voice request information according to a pre-stored template to obtain a first understanding result.
  • the processor is configured to perform natural language understanding processing on the voice request information according to a pre-stored template to obtain a first understanding result.
  • the natural language understanding of the voice input information based on the pre-stored template without combining the GUI information, the simple semantic understanding of the voice request can be used for more generalized voice requests to a certain extent, especially those that cannot accurately hit GUI elements.
  • Voice requests are semantically understood to ensure the recall of voice requests.
  • S111 includes:
  • S1111 Perform natural language understanding on the voice request based on a predetermined template to generate a first understanding sub-result
  • S1112 Perform natural language understanding on the speech request based on a predetermined classification model to generate a second understanding sub-result
  • S1111 - S1113 may be implemented by the semantic fusion module 102 .
  • the semantic fusion module 102 is configured to perform natural language understanding on the speech request based on a predetermined template to generate a first understanding sub-result, and is configured to perform natural language understanding on the speech request based on a predetermined classification model to generate a second understanding sub-result, and for fusing the first comprehension sub-result and the second comprehension sub-result to obtain the first comprehension result.
  • the processor is configured to perform natural language understanding of the speech request based on a predetermined template to generate a first understanding sub-result, and to perform natural language understanding of the speech request based on a predetermined classification model to generate a second understanding sub-result , and is used to fuse the first comprehension sub-result and the second comprehension sub-result to obtain the first comprehension result.
  • the language understanding processing of speech requests using pre-stored templates is divided into two groups. Understandably, the processing of different templates is different. Some templates focus on understanding the accuracy of the results, and some templates focus on understanding the recall rate of the results. Different templates with the same focus are complementary to different fields of business.
  • one of the groups focuses on the priority of precision, and the templates therein may include an AC automaton template, a syntax tree template, a regular expression template, and the like.
  • Another group focuses on recall priority, and the models can include BERT classification models, LSTM classification models, and GBDT classification models.
  • the natural language understanding processing is performed through the above-mentioned predetermined template, and the corresponding first understanding sub-results can be obtained respectively.
  • the voice request will also go through the above-mentioned predetermined classification model to generate the corresponding second comprehension sub-result, and then pass the first comprehension sub-result and the second comprehension sub-result through a corresponding fusion strategy to realize the first comprehension sub-result. Understand the fusion process of sub-results and second understand sub-results.
  • the convergence strategy includes general convergence strategy and custom convergence strategy.
  • the general convergence strategy applies to all services, and the custom convergence strategy sets specific strategies for some specific services.
  • the general fusion strategy adjusts the weight and priority of the corresponding comprehension sub-results according to the confidence of each comprehension sub-result, and then performs fusion processing on the weighted voting of each comprehension sub-result.
  • the priority of the relevant comprehension sub-results can be adjusted by considering whether the sentence template is hit and whether the context is from the same domain.
  • one of the multiple comprehension sub-results can be directly selected. one as the final fusion result.
  • the custom fusion strategy supports hot updates, and the server maintainers can continuously adjust the fusion strategy and add new specific business scenarios through big data information that understands the natural language of the input information. Through this layering mechanism, it is ensured that the fusion of multiple understanding sub-results can be flexible enough, which is both versatile and can be adapted to the needs of special business scenarios.
  • results such as playing music, navigating to Beijing, querying locations, etc. are obtained through the above-mentioned predetermined template and classification model, and the corresponding weights of the above-mentioned results are 70%, 10%, 20%, after weighted voting can get the fusion result that the user's intention is to play music.
  • S12 includes:
  • S121 Perform natural language understanding processing on the voice request information in combination with a graphical user interface to generate a second understanding result.
  • S121 may be implemented by the semantic fusion module 102, or in other words, the semantic fusion module 102 is configured to perform natural language understanding processing on the voice request information in combination with a graphical user interface to generate a second understanding result.
  • the processor is configured to perform natural language understanding processing on the voice request information in conjunction with a graphical user interface to generate a second understanding result.
  • combining the GUI information can ensure the consistency of the voice interaction and the graphical user interface to improve the accuracy, and for the voice request with a high degree of matching with the GUI information, the user intent can be mapped according to the semantic understanding combined with the GUI information.
  • S121 includes:
  • S1211 Perform natural language understanding on the voice request based on the graphical user interface to generate a third understanding sub-result
  • S1212 Perform language understanding on the speech request based on knowledge reasoning to generate a fourth understanding sub-result
  • S1211 - S1213 may be implemented by the semantic fusion module 102 .
  • the semantic fusion module 102 is configured to perform natural language understanding on the voice request based on the graphical user interface to generate a third understanding sub-result, and for performing language understanding on the voice request based on knowledge reasoning to generate a fourth understanding sub-result, and use The third comprehension sub-result and the fourth comprehension sub-result are fused to obtain the second comprehension result.
  • the processor is configured to perform natural language understanding of the speech request based on a graphical user interface to generate a third comprehension sub-result, and for performing linguistic understanding of the speech request based on knowledge reasoning to generate a fourth comprehension sub-result, and It is used to fuse the third comprehension sub-result and the fourth comprehension sub-result to obtain the second comprehension result.
  • focusing on the accuracy priority may be based on the GUI information to perform language understanding on the voice input information, for example, processing including GUI control matching, exact matching, text matching, verb matching, fuzzy matching, pinyin matching, etc. Way. Focusing on recall priority can be based on reasoning knowledge to perform language understanding on speech requests. For example, inference based on action word collocation, inference based on entity synonyms, and reasoning based on abstract classification can be used.
  • the natural language understanding processing is performed based on the GUI, and the corresponding third understanding sub-results can be obtained respectively.
  • the voice request will also undergo knowledge reasoning to generate the corresponding fourth comprehension sub-result, and then the third comprehension sub-result and the fourth comprehension sub-result are subjected to a corresponding fusion strategy to realize the third comprehension sub-result. Fusion processing with the fourth comprehension sub-result.
  • the convergence strategy includes general convergence strategy and custom convergence strategy.
  • the general convergence strategy applies to all services, and the custom convergence strategy sets specific strategies for some specific services.
  • the general fusion strategy can be based on the principle of giving priority to precision and taking into account recall, and use scoring, voting and other mechanisms for each comprehension sub-result, such as minority subordination to majority, weighted voting mechanism, winner tree mechanism, and machine learning model fusion.
  • Related strategies such as Boosting and Bagging are integrated.
  • the fusion strategy may be to score the clarity of intent and completeness of the fields of the input information after natural language understanding processing, and adjust the priority of each understanding sub-result in the fusion voting process according to the scoring weight.
  • the priority of the relevant comprehension sub-results can be adjusted by considering the degree of collocation between the action word and the entity, and whether the key entity word exactly matches.
  • multiple comprehension sub-results can be directly selected. One of them is used as the final fusion result.
  • the custom fusion strategy supports hot updates, and the server maintainers can continuously adjust the fusion strategy and add new specific business scenarios through big data information that understands the natural language of the input information. Through this layering mechanism, it is ensured that the fusion of multiple understanding sub-results can be flexible enough, which is both versatile and can be adapted to the needs of special business scenarios.
  • the operations supported by the control are turn up and down.
  • different matching processing methods are adopted, and the control and related operations can be finally matched to the voice methods of different expressions.
  • the voice request "turn up the navigation volume” can be matched to the control and action through the exact matching process.
  • the voice request "navigation volume” can be matched to this control through text matching processing.
  • the voice request to "make it bigger” can be processed by action word matching to match the operation of the control.
  • the voice request "navigation sound” can be matched to this control through fuzzy matching processing.
  • the voice request "dao hang volume” can be matched to this control through pinyin matching processing.
  • Action word collocation reasoning is to recall the matching results according to the degree of collocation with the relevant verbs in the voice request. For example, the voice request "turn down". Doors, etc. have a low degree of matching and will not be recalled.
  • Entity synonym reasoning is to synonymously expand the entity words of the voice request, so that the voice request can be generalized, so that more results can be recalled. For example, the voice input information "main window” can be expanded to "left front window”.
  • Abstract categorization reasoning is to categorize the entity words in the voice request, so that the voice input information can be generalized, so that more results can be recalled. For example, the voice request "dipped beam” can be expanded to "car lights”.
  • processing methods in different groups are not limited to the methods disclosed in this application, and any natural language understanding processing methods that can achieve the required purpose can be added as required.
  • S20 includes:
  • S21 and S22 may be implemented by the action determination module 104 .
  • the action determination module 104 can be used to perform executable verification on the result of semantic understanding, and to update the dialog state information according to the result of the executable verification.
  • the processor is configured to perform an executable check on the result of the semantic understanding, and to update the dialog state information according to the result of the executable check.
  • the semantic understanding result of the voice request is input into the task tracker, and the task tracker includes a historical task tracker that inherits and stores the dialog state information from the dialog state database. Updates the state of the task tracker based on the input, thereby updating the dialog state information for the current voice request.
  • the data sources of the state update conditions include: the results of semantic understanding, and dialog state information, such as the current dialog script state, and the dialog strategy selection results in the context.
  • the dialogue script is a directed graph for certain tasks that specifically contain multiple processes.
  • a dialogue script can be thought of as a collection of actions contained in one large independent task.
  • a navigation task includes multiple sub-tasks such as searching for POIs, calculating routes, selecting routes, and navigating.
  • the voice request hits the navigation task, it is transferred to the dialogue script, and the subsequent dialogue process flows in the dialogue script of the navigation task, which has a stronger directionality.
  • the current dialogue book state refers to whether the current voice request is in the current script, whether it is in a new script, or whether there is a script.
  • the results of the contextual dialogue strategy selection include whether the historical dialogue adopts global semantics or scene semantics.
  • the task action of the voice request can be determined based on the data source of the status update. That is, the actions triggered by the current task state of the dialog include executing, guiding, clarifying, confirming, canceling, ending, and so on. For example, if the task action is execute, an execute command will be issued. If the task action is guiding, clarifying, or confirming, then generating and playing the guiding, clarifying, confirming utterances.
  • the interaction process due to the lack of information in the request, redundancy or the user's remorse for the historical dialogue, multiple rounds of interaction such as guidance, clarification, and confirmation are required.
  • multiple rounds of interaction such as guidance, clarification, and confirmation are required.
  • the information that the user pays attention to is often no longer the entire graphical interaction page, but the related controls mentioned above, that is, part of the scene data. Therefore, by inheriting the dialogue state information, some pages can also be shielded as scene information and passed to the next round of semantic understanding.
  • the context manager can be used to store and read the task tracker. Only relevant controls are kept. For example, when loading dialogue state information, if there are multiple rounds of task states, compare the executable and guide-clarification controls in the historical scene, and only keep the duplicate controls that are hit by the current semantic understanding.
  • Executable check is a paradigm check on the hit control element in the semantic understanding result, to determine whether the hit control element is executable.
  • the semantic understanding result hits the button and its click operation.
  • the execution conditions of the click operation are the control and the operation, and the button meets the test conditions and can be executed.
  • the returned results of the controls output by the semantic understanding result are checked in turn, so as to obtain whether the controls are executable and whether they can lead to clarification.
  • the dialog state information is updated according to the result of the executable verification, so as to provide a basis for confirming the task action. For example, after the executable check confirms that there are two executable controls, the dialog action of the dialog state information may be updated to clarify or guide. If it is confirmed that there is an executable control after the executable check, then the dialog action of the dialog state information may be updated to execute.
  • S22 includes:
  • S223 Determine the task action corresponding to the voice request according to the task quantity.
  • S221 - S223 may be implemented by the action determination module 104 .
  • the action determination module 104 can be used to update the nodes hit in the task tree of the dialogue state information according to the verification result, and to determine the number of tasks corresponding to the voice request according to the number of nodes, and to determine the number of tasks corresponding to the voice request according to the number of tasks. Determine the task action corresponding to the voice request.
  • the processor is configured to update the nodes hit in the task tree of the dialogue state information according to the verification result, and to determine the number of tasks corresponding to the voice request according to the number of nodes, and to determine the number of tasks corresponding to the voice request according to the number of tasks. Determines the task action corresponding to the voice request.
  • the task tree in the dialog state is a tree-like diagram used to represent the layout relationship of the controls in the scene.
  • the GUI may include one or more applications running simultaneously, and the one or more applications are composed of one or more controls. Therefore, there may be control layouts for multiple organizational structures.
  • the task tree builds these controls in the form of a tree-like diagram, and each node represents a control in the scene data.
  • the root node in the task tree is the view of the current GUI, and the number of executable tasks of the root node represents the number of executable controls in the current scene after the executable check.
  • the number of its executables is counted, and then the number of executables of each node is accumulated from the bottom to the top of the task tree until it reaches the root node.
  • the task action is the judgment of the executable number. If the executable number is 1, then the task action is executed. If the executable is greater than 1, further judgment needs to be made according to the dialog status information.
  • the task action may be clarifying or guiding, and if the current dialogue round is greater than or equal to the round threshold, the task action is a failure.
  • the guidance refers to forming a teaching for the user with exemplary feedback information, and guiding the user to perform voice interaction in the style of the feedback information, so that a voice request with clearer semantics can be input. For example, “Do you want to open the left front door window, please?", "Please give me the instruction to open the left front door window again” and other feedback.
  • Clarification means that the user can explain and clarify the unclear request in the first round of dialogue in the follow-up dialogue, so that the user's semantics can be clarified. For example, “Which window do you want to open?", "What height do you want to open?” and other feedback.
  • the user interacts with the vehicle through voice, sends out a voice request of "open the window", and after semantic understanding, hits five buttons, "open the left front door window”, “open the left rear door window” ", “Open right front door window”, “Open right rear door window”, “Open sunroof” and their click operations.
  • the executable conditions of the click operation are the control and the operation, and the five buttons meet the conditions that can be executed. According to statistics, the number of tasks that can be executed by the root node in the task tree is 5, which cannot be directly executed. Therefore, feedback information for guidance or clarification will be generated and sent to the vehicle, which will be broadcast by the vehicle.
  • Whether the task action is to guide or clarify is determined according to the specific situation of the dialogue state information. For example, it is currently the first round of dialogue. Due to the small number of rounds, the upper limit of 3 rounds has not been reached, and the task action can be determined to be clarification. For another example, the current dialogue is the second round. According to the voice request of this round, it is still impossible to uniquely determine the executable task, so the user needs to be guided to express accurately in the third round of dialogue, otherwise it may lead to interaction after the third round of dialogue. End, at this point, it can be determined that the task action is a guide.
  • S30 includes:
  • S31 Perform hierarchical processing on the input information for fusion processing to obtain the first-layer voice request information and the second-layer voice request information;
  • S32 According to a predetermined strategy, respectively match the strategy of the first layer of voice request information and the second layer of voice request information and obtain the corresponding priority;
  • S33 Perform fusion processing on the priority of the voice request information of the first layer and the priority of the voice request information of the second layer to determine a task corresponding to the task action.
  • S31 - S33 may be implemented by the task determination module 106 .
  • the task determination module is used to perform hierarchical processing on the input information used for semantic understanding to obtain the first-layer voice request information and the second-layer voice request information, and to match the first-layer voice requests respectively according to a predetermined strategy. information and the strategy of the second-layer voice request information and obtain the corresponding priority, and the task used to fuse the priority of the first-layer voice request information and the priority of the second-layer voice request information to determine the corresponding task action .
  • the processor is configured to perform hierarchical processing on the input information for semantic understanding to obtain the first-level voice request information and the second-level voice request information, and to match the first-level voice request information and the second-level voice request information respectively according to a predetermined strategy.
  • the task corresponding to the voice request is generated by integrating multiple information sources.
  • the multi-channel information sources include the fusion result of global semantic understanding, the fusion result of scene semantic understanding, scene data, context information, etc.
  • the task tracker and scene data containing the above-mentioned information sources are used as input data and processed through a predetermined strategy to determine the final task.
  • the multi-channel input information is processed in layers, the priorities of the execution strategies corresponding to the layered information are respectively determined, and the priorities of the execution strategies of each layer are merged to obtain the final task.
  • the complexity of task determination processing is reduced in a hierarchical manner, so that tasks in the process of multiple rounds of voice interaction have a clear generation and determination strategy.
  • the predetermined policy may be a policy list pre-stored in the server's memory, including a plurality of execution policies and corresponding priority scores. Different hierarchical information may match different execution strategies, and different execution strategies correspond to different priority scores. In the case of different priorities, fusion processing is performed to obtain the final execution strategy of the voice request.
  • the strategy of fusion processing may be a strategy of taking a high score.
  • it can also be comprehensively considered according to priority and weight.
  • S31 includes:
  • S312 Divide the feature information into the first-layer voice request information by taking the dialogue state information as an element
  • S313 Divide the feature information into the second-level voice request information with semantic understanding as an element.
  • S311 - S313 may be implemented by the task determination module 106 .
  • the task determination module 106 is used for extracting feature information in the input information, and for dividing the feature information into the first-level voice request information by taking the dialogue state information as an element, and for dividing the feature information by taking the semantic understanding as an element Request information for Layer 2 voice.
  • the processor is used for extracting feature information in the input information, and for dividing the feature information into the first-level voice request information with the dialog state information as the element, and for dividing the feature information with the semantic understanding as the element It is divided into the second layer of voice request information.
  • the feature information is the information obtained after the voice request is processed by natural language understanding, etc., including but not limited to the domain name, intent, intent ID, etc., which are hit after global semantic understanding, and the scene ID, Element ID, attribute information of the scene ID obtained by combining the scene data, etc., time domain ID of multiple rounds of dialogue recorded in the dialogue state tracker, status information such as whether it is in the script, etc.
  • Extracting feature information is the preprocessing process before strategy matching, converting the above feature information into attribute labels that can be accepted by the strategy matching engine, or the judgment conditions for task matching.
  • Each layer may have multiple attribute values. Assemble multiple attribute values into an object, that is, the voice request information corresponding to this layer.
  • the preprocessing process includes judging the hit scene elements through scene data and scene semantic understanding and transforming them into attribute labels that can be used for stratification, and transforming them into scene semantic understanding through global semantic understanding and transforming them into attribute labels that can be used for stratification, and obtaining Related information in the task tracker, etc.
  • the transformation of global semantic understanding into scene semantic understanding occurs when the scene semantic understanding is not recalled and the global semantics in the predefined mapping library can match the scene semantics. In this way, the business can be improved and the scene information can be processed. effective supplement.
  • the predetermined layering rule is to divide the above assembled objects according to certain rules. In other words, the combined object is divided into the first layer and the second layer according to the policy matching rules.
  • first layer and the second layer do not have a hierarchical or hierarchical relationship of inclusion, progression, etc., and can be regarded as two parallel objects performing task matching in different ways.
  • the task tracker information includes status information such as whether to open the scene task, the number of rounds of the current dialogue, whether the dialogue is in the script, whether there is global semantics, whether the scene semantics and other status information.
  • status information such as whether to open the scene task, the number of rounds of the current dialogue, whether the dialogue is in the script, whether there is global semantics, whether the scene semantics and other status information.
  • feature information is extracted, and part of the feature information is combined into the first-level voice request information according to the task tracker information, so that the task strategy can be matched with the dialogue state information.
  • the semantic understanding information includes the fusion result of scene semantic understanding, the fusion result of global semantic understanding, and scene data information.
  • the second layer of voice request information is obtained by combining the feature information with the semantic understanding information as the element and part of the feature information, so that the task strategy can be matched through the semantic understanding information.
  • the voice request "turn up the volume” the feature information may include: after global semantic understanding, the operation is adjustment, and the object is volume; after the scene semantic understanding, hit the navigation volume slider, the scene data has The navigation map page has volume up related controls.
  • the information in the dialogue state tracker includes that the dialogue is the Nth round, not in the script, and the scene task hits multiple volume up controls.
  • the scene needs to be clarified.
  • the feature information is divided into the first layer of voice request information, including the dialogue as the Nth round. In the script, it needs to be clarified to increase the number of hits in multiple scenes.
  • the second-level voice request information includes that the scene page is a navigation, and the scene semantics hits the navigation volume slider.
  • the strategy for matching the voice request information of the first layer is "scene clarification"
  • the strategy for matching the voice request information of the second layer is "scene priority".
  • S32 includes:
  • S321 According to the dialogue state information, match the first task of the first layer of voice request information from a predetermined strategy and obtain the priority of the first task;
  • S322 According to the result information of the semantic understanding, match the second task of the second-level voice request information from the predetermined strategy and obtain the priority of the second task.
  • S321 and S322 may be implemented by the task determination module 106 .
  • the task determination module 106 is configured to match the first task of the first layer of voice request information from the predetermined strategy according to the dialogue state information, and obtain the priority of the first task, and the result information for self-predetermined semantic understanding.
  • the strategy matches the second task of the second layer voice request information and obtains the priority of the second task.
  • the processor is configured to match the first task of the first layer of voice request information from a predetermined strategy according to the dialogue state information, and obtain the priority of the first task, and the result information for semantic understanding, and automatically
  • the predetermined policy matches the second task of the second layer voice request information and obtains the priority of the second task.
  • the voice request is reassembled into objects with different judgment conditions after being preprocessed, that is, the voice request information of the first layer and the voice request information of the second layer.
  • Different layers of voice request information have corresponding matching rule sets.
  • the matching rules for the first layer of voice request information include judging whether it is a scene task, predicting the direction of the dialogue state, the number of dialogue rounds, whether it is in a script, etc.
  • the matching rule for the second voice request information includes the scene id, the policy tag of the intent mapping, and the like.
  • the corresponding tasks are matched in the predetermined strategy for the voice request information of different layers, and the priority corresponding to the task is obtained at the same time.
  • Some of the predetermined policies are relatively independent and determinable default policies. For example, in a scenario, a voice request hits two elements in the GUI. In this case, the user needs to clarify. During the implementation process, some situations other than the default strategy will also be encountered. In the face of these situations, the custom strategy can be supplemented. The custom strategy can be hot updated, which is convenient for maintenance personnel to supplement at any time. Hot update can be dynamically updated at any time. It is faster, more convenient and easier to maintain without modifying the structure of the original strategy. For example, in the dialogue state in the navigation script, the voice request is to zoom in on the map, but zooming in on the map is not an action of the script.
  • the script task needs to be paused and the command to zoom in on the map will be executed first, that is, in this case , you need to increase the priority of the zoom-in map command, you can customize a strategy to increase the priority of the zoom-in map command.
  • the predetermined policy includes a default policy and a custom policy, wherein the default policy part is used for the first-layer voice request information matching task, and the custom policy is used for the second-layer voice request information matching task.
  • S33 includes:
  • S331 Compare the scores corresponding to the priority of the first task and the priority of the second task according to a predetermined strategy
  • S332 Determine a task with a higher score as a task corresponding to the task action according to the comparison result.
  • S331 and S332 may be implemented by the task determination module 106 .
  • the task module 106 is configured to compare the scores corresponding to the priority of the first task and the priority of the second task according to a predetermined strategy, and determine a task with a higher score as the task corresponding to the task action according to the comparison result.
  • the processor is configured to compare the scores corresponding to the priority of the first task and the priority of the second task according to a predetermined strategy, and determine a task with a higher score as the corresponding task action according to the comparison result. Task.
  • a priority score is provided corresponding to each task in the list of predetermined strategies, and a corresponding priority score is determined after matching the corresponding task.
  • a high-scoring priority fusion strategy is adopted for two tasks with different priority scores, that is, the final task is the one with a higher priority score.
  • a conversion process needs to be performed, so as to convert the scene information into the global task. Its purpose is to construct a globally semantic input format. If the task is not converted, the task corresponding to the voice request cannot be executed.
  • a voice request is input to "search for charging piles", and after the above-mentioned layering, matching, and fusion strategies, the final task is to turn the scene into the global scene.
  • the voice request hits the charging pile element in the scene interface through scene semantic understanding, and the corresponding action is to search for charging piles within a predetermined range centered on the current position.
  • the corresponding action is to search for the charging pile along the current navigation route.
  • the charging pile element is hit in the current navigation scene, then the charging pile near the scene semantic search is converted to the charging pile along the global semantic search, and the corresponding task is generated.
  • a new single-round task is generated.
  • the pending processing of the task is involved.
  • the task is a historical global script
  • the condition is a single-round instruction that does not trigger a page jump
  • the task is to suspend the historical global script and execute a single-round instruction.
  • the click event of some controls jumps the page, it will not affect the historical global script task, and the historical global script task can also be suspended.
  • the current GUI is a navigation map interface
  • the user sends a voice request "navigate to location A" in the first round
  • determines the intent of navigation according to the relevant semantic understanding and triggers the entry of the navigation dialogue script.
  • the task determined in the first round is the global script task
  • the system feedback "I found three places for you, which one should I go to?”
  • the GUI will provide a list of the three places found on the navigation map page.
  • the user wants to zoom in on the map and then confirms the actual destination, and sends a voice request to "enlarge the map".
  • the process of determining the task action is as follows: the semantics of the scene is understood as the map scale sliding up, and the scene information map page has a button to zoom in on the map.
  • the slider supports the sliding operation, the number of tasks tree executable is 1, and the dialog action is determined to be executed.
  • the dialog state is to perform a sliding up of the map scale slider.
  • the process of task determination is as follows.
  • the current is the scene semantics, the second round of the dialogue, not in the dialogue script, the dialogue state direction is execution, etc.
  • This part of the information can be used as the first layer of voice.
  • Request information the results of zooming in on the map and zooming in the scale slider in the map interface can be obtained, and this part of the information can be used as the second-layer voice request information.
  • the two layers of voice request information respectively correspond to the tasks performed by the scene and the tasks prioritized by the scene. Among them, the priority of the task of the scene priority is higher than the task of the scene execution.
  • the task of the second round of voice request is the scene priority, that is, the operation of sliding up the scale slider in the navigation map is preferentially performed.
  • the current scene is navigation and the last round was a global navigation script
  • the current scene semantically hits the scale slider and is executed. At this time, the global navigation script task will be suspended.
  • the embodiments of the present application also provide a computer-readable storage medium.
  • One or more non-volatile computer-readable storage media containing computer-executable instructions when the computer-executable instructions are executed by one or more processors, cause the processor to execute the voice interaction method for a vehicle of any of the above embodiments .
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请公开了一种语音交互方法。包括:对语音请求进行语义理解的结果进行融合处理;对所述融合处理的结果进行校验以对对话状态信息进行更新从而确定对应语音请求的任务动作;根据预定规则进行决策以确定对应任务动作的任务。本申请实施方式的语音交互方法中,在用户与车辆进行语音交互过程中,在语义理解时融合不同的理解结果,同时结合对话历史,来确定任务和任务的执行动作,利用多维度信息准确理解用户的真实意图,语音交互的智能性和用户体验更佳。本申请还公开了一种服务器及存储介质。

Description

语音交互方法、服务器和计算机可读存储介质
相关申请的交叉引用
本申请要求于2020年09月18日提交中国专利局的申请号为CN202010986214.6、名称为“语音交互方法、服务器和计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音识别技术领域,特别涉及一种用于车辆的语音交互方法、服务器和计算机可读存储介质。
背景技术
随着人工智能技术的发展,语音智能平台或者说语音助手因为能够在一定条件下识别用户的语音输入并生成相应的操作指令,为用户操作车辆设备,例如车辆的中控显示屏,提供了极大的便利,而被广泛应用。然而,相关技术在语音识别过程中,通常仅利用当前接收的语音信息,而缺少更多维度的信息,使得人机交互的质量不佳,用户体验较差。
发明内容
有鉴于此,本申请的实施例提供了一种用于车辆的语音交互方法、服务器和计算机可读存储介质。
本申请提供了一种用于车辆的语音交互方法,包括:
对语音请求进行语义理解的结果进行融合处理;
对所述融合处理的结果进行校验以对对话状态信息进行更新从而确定对应所述语音请求的任务动作;
根据预定规则进行决策以确定对应所述任务动作的任务。
在某些实施方式中,所述对语音请求进行语义理解的结果进行融合处理包括:
对所述语音请求进行全局语义理解以得到第一理解结果;
对所述语音请求进行场景语义理解以得到第二理解结果。
在某些实施方式中,所述对所述语音请求进行全局语义理解以得到第一理解结果包括:
根据预存的模板对所述语音请求信息进行自然语言理解处理以得到第一理解结果。
在某些实施方式中,所述根据预存的模板对所述语音请求信息进行自然语言理解处理以得到第一理解结果包括:
基于预定模板对所述语音请求进行自然语言理解以生成第一理解子结果;
基于预定的分类模型对所述语音请求进行自然语言理解以生成第二理解子结果;
将所述第一理解子结果和所述第二理解子结果进行融合以得到所述第一理解结果。
在某些实施方式中,所述对所述语音请求进行场景语义理解以得到第二理解结果包括:
结合图形用户界面对所述语音请求信息进行自然语言理解处理以生成第二理解结果。
在某些实施方式中,所述结合图形用户界面对所述语音请求信息进行自然语言理解处理以生成第二理解结果包括:
基于图形用户界面对所述语音请求进行自然语言理解以生成第三理解子结果;
基于知识推理对所述语音请求进行自然语言理解以生成第四理解子结果;
将所述第三理解子结果和所述第四理解子结果进行融合以得到所述第二理解结果。
在某些实施方式中,所述对所述融合处理的结果进行校验以对对话状态信息进行更新从而确定对应所述语音请求的任务动作包括:
对所述融合处理的结果进行可执行校验;
根据所述可执行校验的结果对所述对话状态信息进行更新。
在某些实施方式中,所述根据所述可执行校验的结果对所述对话状态信息进行更新包括:
根据所述校验的结果更新在所述对话状态信息的任务树中命中的节点;
根据所述节点的数量确定所述对应所述语音请求的任务的数量;
根据所述任务数量的确定对应所述语音请求的任务动作。
在某些实施方式中,所述根据预定规则进行决策以确定对应所述任务动作的任务包括:
将用于进行所述融合处理的输入信息进行分层处理以得到第一层语音请求信息和第二层语音请求信息;
根据预定的策略分别匹配所述第一层语音请求信息和所述第二层语音请求信息的策略并得到对应的优先级;
对所述第一层语音请求信息的优先级和所述第二层语音请求信息的优先级进行融合处理以确定所述任务。
在某些实施方式中,所述将用于进行语义理解的输入信息进行分层处理以得到第一层语音请求信息和第二层语音请求信息包括:
提取所述输入信息中的特征信息;
以所述对话状态信息为要素将所述特征信息划分为所述第一层语音请求信息;
以语义理解为要素将所述特征信息划分为所述第二层语音请求信息。
在某些实施方式中,所述根据预定的策略分别匹配所述第一层语音请求信息和所述第二层语音请求信息的策略并得到对应的优先级包括:
根据所述对话状态信息,自所述预定的策略匹配所述第一层语音请求信息的第一任务并获取所述第一任务的优先级;
根据所述语义理解的结果信息,自所述预定的策略匹配所述第二层语音请求信息的第二任务并获取所述第二任务的优先级。
在某些实施方式中,所述对所述第一层语音请求信息的优先级和所述第二层语音请求信息的优先级进行融合处理以确定所述任务包括:
根据所述预定的策略,比较所述第一任务的优先级和所述第二任务的优先级对应的分值;
根据比较结果将所述分值更高的任务确定为所述任务动作的任务。
本申请提供了一种服务器,包括:
语义融合模块,用于对语音请求进行语义理解的结果进行融合处理;
动作确定模块,用于根据所述语义理解的结果确定对应所述语音请求的任务动作;
任务确定模块,用于根据预定规则进行决策以确定对应所述任务动作的任务。
本申请提供了一种服务器,包括存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,实现所述的语音交互方法。
本申请提供了一种包含计算机可执行指令的非易失性计算机可读存储介质,当所述计算机程序被一个或多个处理器执行时,实现所述的语音交互方法。
本申请实施方式的语音交互方法、服务器和计算机可读存储介质中,在用户与车辆进行语音交互过程中,在语义理解时融合不同的理解结果,同时结合对话历史,来确定任务和任务的执行动作,利用多维度信息准确理解用户的真实意图,语音交互的智能性和用户体验更佳。
附图说明
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1是本申请某些实施方式的语音交互方法的流程示意图。
图2是本申请某些实施方式的服务器的模块示意图。
图3至图7是本申请某些实施方式的语音交互方法的流程示意图。
图8是本申请某些实施方式的语音交互方法的交互示意图。
图9至图10是本申请某些实施方式的语音交互方法的流程示意图。
图11是本申请某些实施方式的语音交互方法的交互示意图。
图12至图15是本申请某些实施方式的语音交互方法的流程示意图。
具体实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。
请参阅图1,本申请提供了一种用于车辆的语音交互方法。包括:
S10:对语音请求进行语义理解的结果进行融合处理;
S20:对融合处理的结果进行校验以对对话状态信息进行更新从而确定对应所述语音请求的任务动作;
S30:根据预定规则进行决策以确定对应的任务动作的任务。
本申请实施方式提供了一种服务器。服务器包括通信元件和处理器。通信元件用于接收车辆上传的语音请求。处理器用于对接收到的语音请求进行语义理解,及用于根据所述语义理解的结果和对话状态信息确定对应所述语音请求的任务动作,以及根据预定规则进行决策以确定对应所述任务动作的任务。
请参阅图2,本申请实施方式还提供了一种服务器100,本申请实施方式的语音交互方法可以由本申请实施方式的服务器100实现。
具体地,服务器100包括语义融合模块102、动作确定模块104和任务确定模块106。S10可以由语义融合模块102实现,S20可以由动作确定模块104实现,S30可以由任务确定模块106实现。或者说,语义融合模块102用于对语音请求进行语义理解的结果进行融合处理。动作确定模块104用于对融合处理的结果进行校验以对对话状态信息进行更新从而确定对应所述语音请求的任务动作。任务确定模块106用于根据预定规则进行决策以确定对应的任务动作的任务。
本申请实施方式的用于车辆的语音交互方法和服务器100中,在用户与车辆进行交互过程中,在语义理解的同时,结合对话历史,来确定任务和任务的执行动作,利用多维度信息准确理解用户的真实意图,语音交互的智能性和用户体验更佳。
具体地,车辆包括显示区域、电声元件、通信元件和处理器等。车辆的显示区域可以包括仪表屏、车载显示区域屏幕以及车辆挡风玻璃上可以实现的抬头显示等。车辆上运行的车载系统使用图形用户界面(Graphical User Interface,GUI)为用户呈现展示的内容。显示区域包括诸多UI元素,不同的显示区域可以展示相同或不同的UI元素。其中,UI元素可以包括卡片对象、应用程序图标或界面、文件夹图标、多媒体文件图标以及用于进行交互可操作的控件等。其中,车辆的显示区域可以为用户提供控制车辆以及与车辆进行交互的便捷入口,在车载操作系统中添加语音助手,能够在一定条件下通过识别语音便捷地生成相应的控制指令,进一步地为用户与车辆的交互提供便利。
在本实施方式中,用户唤醒语音助手后,发送语音请求,例如用户通过当前图形用户界面进行车辆的相关控制而发出语音控制指令,例如在空调控制界面中下达相关的空调控制指令,再如在多媒体图形用户界面中下达多媒体播放的相关播放控制指令等。在诸如上述应用场景中,在将语音请求上传的同时,实时上传车辆系统当前正在运行的车载系统或者说应用程序的图形用户界面信息至服务器的场景信息数据库中。图形用户界面信息包括当前图形用户界面中元素的布局信息,如当前图形用户界面中包含的控件、每个控件的类型、位置,不同控件之间的关联关系等。后续在进行语义理解,特别是进行场景语义理解时,可以根据场景数据库中的相关进行自然语言理解。
具体地,图形用户界面信息也即是场景数据信息,以图形用户界面中的控件为单位,信息包括控件的控件标识、控件类型、文本描述、控件支持的操作方式、操作参数、多个控件在界面中的位置、布局关系等相关属性。
其中,控件标识可用于标识当前图形用户界面中的每个元素,每个元素具有唯一的标识。元素也即是当前图形用户界面中呈现的内容,以信息点卡片界面为例,其中的元素包括信息点名称、地址、收藏、搜周边、导航路线等。
文本描述,也即是对该元素在图形用户界面中的表达方式,例如,对于收藏控件,其文本描述为“收藏”。
控件类型也即是该元素在图形用户界面中的元素呈现类型,例如按钮、滑块、状态按钮、文本输入框、复选框、单选按钮、群组按钮、开关按钮、视图、组、对话框等。
控件支持的操作方式,即对应类型的控件可以进行的操作,例如按钮可支持的操作包括点击及选中,滑块可支持的操作包括滑动及选中,状态按钮可支持的操作包括点击、滑动、选中、单选及多选,文本输入框可支持的操作包括点击、选中及输入文本复选框可支持的操作包括点击、多选及选中,单选按钮可支持的操作包括点击、单选及选中,群组按钮可支持的操作包括点击、方位滑动和选中,开关按钮可支持的操作包括点击、打开、关闭和选中,视图可支持的操作包括点击、方位滑动、单选和选中、组可支持的操作包括点击和选中、对话框的操作包括点击和方位滑动。
操作参数对应操作方式的程度,例如,点击对应的操作参数为短按、长按,方位滑动对应的操作参数为大中小等。
多个控件在界面中的位置以及布局关系反映了相关元素在图形用户界面中的布局信息,相当于为服务器提供了视觉信息,使得服务器能够获取到用户所见的图形用户界面。
在进行连续多轮对话中,对语音请求的语义理解可融合多路信息,多路信息包括全局语义、场景语义等。其中全局语义是不结合GUI信息,将语音请求解析为结构化字段领域-意图-槽位进行语义理解,场景语义是指结合GUI信息,对语音请求进行语义理解,例如理解操作的控件序号、执行动作、控件取值等。
而对于每一路的语义理解又包括多种处理方式,例如采用的模型不同,输出的语义理解结果也不相同。本申请中的融合处理是指,将采用不同方式进行的全局语义理解的结果进行融合,将采用不同方式进行场景语义理解的结果进行融合。而究竟采用哪个语义理解的融合结果来确定任务需要进一步的决策。
服务器中还具有一对话状态数据库,对话状态数据库中记录有多轮对话的历史数据,例如包括历史对话的上下文内容、对话轮数、任务树形图等信息。对话状态数据库中的数据会在每一轮对话的过程中进行更新。其中,更新的依据可包括但不限于场景语义理解的结果、上一时刻或者说历史对话状态信息等。
根据更新后的对话状态信息可确定当前语音请求的任务动作,所述任务动作是指语音请求对应的任务的走向,例如澄清、引导、执行或失败等,而非具体的任务。
具体的任务则根据预定规则对话状态信息、场景数据信息等多路输入进行决策确定,可以理解地,不同的语义理解的融合结果作为输入信息可能对应不同任务,在这种情况下,需要相关的规则对不同任务的优先级进行评价,从而确定当前语音请求的任务,并根据任务动作执行任务。
在一个示例中,首轮,用户发出语音请求“调高”,根据语义理解,全局语义理解的融合结果命中了调高的操作,场景语义理解的融合结果命中了多个可调高的控件等结果。将不同的语义 理解的结果输入对话状态信息后,根据场景语义理解的结果由于命中了多个可调高控件,当前为对话的首轮,确定当前轮次的任务动作为引导。结合语义理解以及场景数据确定任务是引导任务。系统可反馈“您想调高什么,试着对我说调高温度”。
次轮,用户发出语音请求“调高亮度”,根据语义理解,场景语义理解的融合结果命中两个亮度可调高的滑块等结果。全局语义理解的融合结果命中了调高的操作动作以及亮度这个对象。将不同的语义理解的融合结果输入对话状态信息后,由于命中了两个可调高的滑块包括仪表屏量亮度滑块和车载显示区域屏幕亮度滑块,当前为对话次轮,确定当前轮次的任务为澄清。结合语义理解以及场景数据确定任务是澄清任务。任务图形用户界面提供可调节对象的选择列表,第一个为仪表亮度,第二个为车载显示区域屏幕亮度。系统可反馈“找到仪表亮度和大屏亮度,要调高哪个”。
第三轮,用户发出语音请求“第一个”,根据语义理解,全局语义理解的融合结果命中了序号1,场景语义理解的融合结果命中了一个仪表亮度可调高。将不同的语义理解的融合结果输入对话状态信息后,由于命中了唯一的可调高的滑块为仪表屏量亮度当前为对话第三轮,确定当前轮次的任务为执行,结合语义理解以及场景数据确定任务是执行仪表屏亮度滑块右滑的任务,系统反馈“仪表亮度已调高”。
请参阅图3,在某些实施方式中,S10包括:
S11:对语音请求进行全局语义理解以得到第一理解结果;
S12:对语音请求进行场景语义理解以得到第二理解结果。
在某些实施方式中,S11-S13可以由语义融合模块102实现,或者说,语义融合模块102用于对语音请求进行全局语义理解以得到第一理解结果,以及用于对语音请求进行场景语义理解以得到第二理解结果。
在某些实施方式中,处理器用于对语音请求进行全局语义理解以得到第一理解结果,以及用于对语音请求进行场景语义理解以得到第二理解结果。
具体地,本实施方式中,在对语音请求进行语义理解的过程中,分为两路执行。具体而言,其中一路为不结合GUI信息而根据预存的模板对语音请求进行自然语言理解处理。另一路为结合GUI信息对语音请求进行自然语言理解处理。基于上述两种策略对语音请求分别进行语言理解处理从而得到第一理解结果和第二理解结果。
可以理解地,基于预存的模板对语音输入信息进行自然语言理解能够在一定程度上保证输入信息泛化的召回,结合GUI信息可以保证语音交互与图形用户界面的一致从而提高精确度。如此,采用不同策略进行语言理解处理,实现了不同策略之间的优势互补,兼顾了召回率与准确率从而使得语音识别的效果更好。
在一个示例中,用户在车窗调节的图形用户界面,并发出打开车窗相关的语音请求,语音请求和车窗控制页面的GUI信息共同上传至服务器,服务器根据预存的模板对语音请求进行自然语言理解处理得到第一理解结果,结合GUI信息进行语言理解处理得到第二理解结果,例如,用户发出的语音请求为“打开”,根据预定的模板可能召回多个结果,第一理解结果命中全局操作打开,召回所有可开关的对象如车窗、车门、车灯等。而结合当前GUI信息,即车窗控制页面的GUI信息,可以返回打开车窗的结果的第二理解结果。
请参阅图4,在某些实施方式中,S11包括:
S111:根据预存的模板对语音请求信息进行自然语言理解处理以得到第一理解结果。
在某些实施方式中,S111可以由语义融合模块102实现,或者说,语义融合模块102用于根据预存的模板对语音请求信息进行自然语言理解处理以得到第一理解结果。
在某些实施方式中,处理器用于根据预存的模板对语音请求信息进行自然语言理解处理以得到第一理解结果。
具体地,基于预存的模板对语音输入信息进行自然语言理解,不结合GUI信息,单纯对语音请求进行语义理解,能够在一定程度上对于较为泛化的语音请求,特别是不能精准命中GUI元素的语音请求进行语义理解,保证了语音请求的召回。
请参阅图5,在这样的实施方式中,S111包括:
S1111:基于预定模板对语音请求进行自然语言理解以生成第一理解子结果;
S1112:基于预定的分类模型对语音请求进行自然语言理解以生成第二理解子结果;
S1113:将第一理解子结果和第二理解子结果进行融合以得到第一理解结果。
在某些实施方式中,S1111-S1113可以由语义融合模块102实现。或者说,语义融合模块102用于基于预定模板对语音请求进行自然语言理解以生成第一理解子结果,及用于基于预定的分类模型对语音请求进行自然语言理解以生成第二理解子结果,以及用于将第一理解子结果和第二理解子结果进行融合以得到第一理解结果。
在某些实施方式中,处理器用于基于预定模板对语音请求进行自然语言理解以生成第一理解子结果,及用于基于预定的分类模型对语音请求进行自然语言理解以生成第二理解子结果,以及用于将第一理解子结果和第二理解子结果进行融合以得到第一理解结果。
具体地,采用预存模板对语音请求进行语言理解处理分为两个组别。可以理解地,不同模板的处理侧重不同,有的模板侧重于理解结果的准确性,有的模板侧重于理解结果的召回率。而侧 重性一致的不同模板对于不同领域的业务也各有互补。在本实施例中,其中一个分组以精度优先为侧重,其中的模板可包括AC自动机模板、句法树模板以及正则表达式模板等。另一个分组以召回优先为侧重,其中的模型可包括BERT分类模型、LSTM分类模型以及GBDT分类模型等。
在实际操作中,对于每一条语音请求,经过上述的预定模板进行自然语言理解处理,可以分别得到对应的第一理解子结果。相对应地,该条语音请求,还会经过上述预定的分类模型从而生成对应的第二理解子结果,进而将第一理解子结果和第二理解子结果经过相应的融合策略,实现对第一理解子结果和第二理解子结果的融合处理。
融合策略包括通用融合策略和自定义融合策略,通用融合策略适用于所有业务,自定义融合策略针对一些特定业务设定特定策略。
具体而言,通用融合策略,根据各个理解子结果的置信度调整相应理解子结果的权重和优先级,然后对各个理解子结果加权投票进行融合处理。
可以理解地,不同的语言理解擅长的领域有所不同,例如导航类指令的理解,跟音乐类指令理解可能会有不一样的融合策略。在自定义融合策略中,可考虑是否命中句式模板、上下文是否来自同一个领域,来调整相关理解子结果的优先级,在自定义融合策略中还可以直接选用多个理解子结果中的某一个作为最终的融合结果。
自定义融合策略支持热更新,服务器的维护人员可以通过对输入信息的自然语言理解的大数据信息,不断调整融合策略,增添新的特定业务场景。通过这种分层机制,保证了多个理解子结果的融合可以有足够的弹性,既有通用性,又可以适配特殊业务场景需要。
在一个示例中,例如对于一条语音请求“北京北京”,经过上述的预定模板和分类模型得到如播放音乐、导航到北京、查询地点等结果,上述结果分别对应的权重为70%、10%、20%,经过加权投票可以得到融合后的结果为用户的意图是播放音乐。
请参阅图6,在某些实施方式中,S12包括:
S121:结合图形用户界面对语音请求信息进行自然语言理解处理以生成第二理解结果。
在某些实施方式中,S121可以由语义融合模块102实现,或者说,语义融合模块102用于结合图形用户界面对语音请求信息进行自然语言理解处理以生成第二理解结果。
在某些实施方式中,处理器用于结合图形用户界面对语音请求信息进行自然语言理解处理以生成第二理解结果。
具体地,结合GUI信息可以保证语音交互与图形用户界面的一致性从而提高精确度,对于与GUI信息匹配度较高的语音请求可以根据结合GUI信息的语义理解来映射用户意图。
请参阅图7和图8,在某些实施方式中,S121包括:
S1211:基于图形用户界面对语音请求进行自然语言理解以生成第三理解子结果;
S1212:基于知识推理对语音请求进行语言理解以生成第四理解子结果;
S1213:将第三理解子结果和第四理解子结果进行融合以得到第二理解结果。
在某些实施方式中,S1211-S1213可以由语义融合模块102实现。或者说,语义融合模块102用于基于图形用户界面对语音请求进行自然语言理解以生成第三理解子结果,及用于基于知识推理对语音请求进行语言理解以生成第四理解子结果,以及用于将第三理解子结果和第四理解子结果进行融合以得到第二理解结果。
在某些实施方式中,处理器用于基于图形用户界面对语音请求进行自然语言理解以生成第三理解子结果,及用于基于知识推理对语音请求进行语言理解以生成第四理解子结果,以及用于将第三理解子结果和第四理解子结果进行融合以得到第二理解结果。
相类似地,在结合GUI信息对语音请求进行自然语言理解处理时,同样基于精度和召回率的不同侧重考虑,设置了不同的语言处理分组,从而通过不同的处理路径实现对同一语音请求进行语言理解处理,进而对基于不同策略得到的结果进行融合,从而得到第二语音理解结果。
具体地,本实施方式中,以精度优先为侧重可以是基于GUI信息对语音输入信息进行语言理解,例如可采用包括GUI控件匹配、精准匹配、文本匹配、动词匹配、模糊匹配、拼音匹配等处理方式。以召回优先为侧重可以是基于推理知识对语音请求进行语言理解,例如,可采用基于动作词搭配推理,基于实体同义词推理,基于抽象归类推理等处理方式。
在实际操作中,对于每一条语音请求,基于GUI进行自然语言理解处理,可以分别得到对应的第三理解子结果。相对应地,该条语音请求,还会经过知识推理从而生成对应的第四理解子结果,进而将第三理解子结果和第四理解子结果经过相应的融合策略,实现对第三理解子结果和第四理解子结果的融合处理。
融合策略包括通用融合策略和自定义融合策略,通用融合策略适用于所有业务,自定义融合策略针对一些特定业务设定特定策略。
具体而言,通用融合策略,可基于精度优先兼顾召回的原则,对各个理解子结果采用打分、投票等机制,如少数服从多数,加权投票机制,胜者树机制,以及机器学习模型融合中的相关策略例如Boosting和Bagging等,进行融合。
在一个示例中,融合策略可以是对经自然语言理解处理后输入信息的意图清晰度、字段的完整度进行打分,根据打分权重调整各个理解子结果在融合投票过程中的优先级。
可以理解地,不同的语言理解擅长的领域有所不同,例如导航类指令的理解,跟音乐类指令 理解可能会有不一样的融合策略。在自定义融合策略中,可考虑动作词与实体搭配度、关键实体词是否精准匹配等条件,来调整相关理解子结果的优先级,在自定义融合策略中还可以直接选用多个理解子结果中的某一个作为最终的融合结果。
自定义融合策略支持热更新,服务器的维护人员可以通过对输入信息的自然语言理解的大数据信息,不断调整融合策略,增添新的特定业务场景。通过这种分层机制,保证了多个理解子结果的融合可以有足够的弹性,既有通用性,又可以适配特殊业务场景需要。
在一个示例中,例如对于一GUI中的控件“导航音量”,该控件支持的操作为调大和调小。结合GUI信息,采用不同匹配处理方式,对于不同表述的语音方式都可以最终匹配到该控件以及相关操作。例如语音请求“调大导航音量”可以通过精准匹配处理匹配到该控件以及动作。语音请求“导航音量”可以通过文本匹配处理匹配到该控件。语音请求“调大一些”可以通过动作词匹配处理匹配到控件的操作。语音请求“导航声音”可以通过模糊匹配处理匹配到该控件。语音请求“dao hang音量”可以通过拼音匹配处理匹配该控件。如此,上述的语言理解处理方式各自胜任一部分能力,最终结合起来可以具有较好的效果。
动作词搭配推理,是根据语音请求中与相关动词搭配的程度进行匹配结果的召回,例如语音请求“调小”,根据匹配程度,与调小可合理搭配的主体可以是灯光、音量等,而车门等搭配度较低,不进行召回。
实体同义词推理,是将语音请求的实体词进行同义扩展,使得语音请求得以泛化,从而能够召回更多的结果。例如语音输入信息“主驾车窗”可扩展为“左前车窗”。
抽象归类推理,是将语音请求中的实体词进行上位归类,使得语音输入信息得以泛化,从而能够召回更多的结果。例如语音请求“近光灯”可扩展为“车灯”。
需要说明地,不同分组内的处理方式不限于本申请公开的方式,能够实现所需目的的自然语言理解处理方式都可以根据需求添加。
请参阅图9,在某些实施方式中,S20包括:
S21:对融合处理的结果进行可执行校验;
S22:根据可执行校验的结果对对话状态信息进行更新。
在某些实施方式中,S21、S22可以由动作确定模块104实现。或者说,动作确定模块104可用于对语义理解的结果进行可执行校验,以及用于根据可执行校验的结果对对话状态信息进行更新。
在某些实施方式中,处理器用于对语义理解的结果进行可执行校验,以及用于根据可执行校验的结果对对话状态信息进行更新。
具体地,在语音交互过程中,将对语音请求的语义理解结果,输入到任务追踪器中,任务追踪器包括从对话状态数据库中继承存储了对话状态信息的历史任务追踪器。根据输入更新任务追踪器的状态,从而更新当前语音请求的对话状态信息。
状态更新条件的数据源包括:语义理解的结果、对话状态信息,如当前对话剧本状态、上下文中对话策略选择结果。
其中,对话剧本是针对某些具体地包含多流程的任务有向图。一个对话剧本可以看作是一个大的独立任务包含的若干动作的集合。设置对话剧本方便对对话进行管理。例如,一个导航任务包括搜索POI、算路、选路线、导航等多个子任务。在语音请求命中了导航任务后,即转入对话剧本,后续对话流程在导航任务的对话剧本中流转,具有更强的指向性。当前对话本状态是指当前的语音请求时否在当前剧本中、是否在新剧本中或有无剧本。
上下文中对话策略选择结果包括历史对话是采用了全局语义或场景语义。
根据状态更新的数据源可确定语音请求的任务动作。也即是,对话当前的任务状态所触发的动作,包含执行、引导、澄清、确认、取消、结束等。例如,如果任务动作是执行则将下发执行命令。如果任务动作是引导、澄清或确认,则生成并播放引导、澄清、确认的话术语音。
在交互过程中,由于请求中信息的不足、冗余或用户对历史对话的反悔,需要引导、澄清、确认等多轮交互。在多轮交互中,用户关注的信息往往不再是整个图形交互页面,而是关注上文提及的相关控件,也即是场景数据的一部分。因此,继承对话状态信息,也可将部分页面屏蔽后作为场景信息,传给下一轮的语义理解。处理过程中,可利用上下文管理器可实现存入、读取任务追踪器。只保留相关控件。例如加载对话状态信息,若存在多轮任务状态,则对比历史场景中的可执行、可引导澄清控件,只保留与当前语义理解命中的重复的控件,若没有重复,则合并保留两部分结果。
可执行校验是对语义理解结果中命中的控件元素进行的范式校验,判断命中的控件元素是否可执行。例如语义理解结果命中了按钮,以及其的点击操作,点击操作的执行条件为控件和操作,按钮满足检验条件,可执行。在处理过程中,依次对语义理解结果输出的控件的返回结果进行校验,从而得出控件是否可执行、是否可引导澄清。根据可执行校验的结果更新对话状态信息进行更新,以用于为确认任务动作提供依据。例如,可执行校验后确认有两个可执行的控件,那么对话状态信息的对话动作就可能更新为澄清或引导。而如果可执行校验后确认有一个可执行的控件,那么对话状态信息的对话动作就可能更新为执行。
请参阅图10,在某些实施方式中,S22包括:
S221:根据校验的结果更新在对话状态信息的任务树中命中的节点;
S222:根据节点的数量确定对应语音请求的任务的数量;
S223:根据任务数量的确定对应语音请求的任务动作。
在某些实施方式中,S221-S223可以由动作确定模块104实现。或者说,动作确定模块104可用于根据校验的结果更新在对话状态信息的任务树中命中的节点,及用于根据节点的数量确定对应语音请求的任务的数量,以及用于根据任务数量的确定对应语音请求的任务动作。
在某些实施方式中,处理器用于根据校验的结果更新在对话状态信息的任务树中命中的节点,及用于根据节点的数量确定对应语音请求的任务的数量,以及用于根据任务数量的确定对应语音请求的任务动作。
具体地,对话状态中的任务树是一种用于表征场景中控件布局关系的类树形图。可以理解地,GUI中可能包含同时运行的一个或多个应用程序,一个或多个应用程序由一个或多个控件构成。因此可能存在多个组织架构的控件布局。任务树将这些控件以类树形图的形式构建起来,每一个节点代表一个场景数据中控件。
任务树中的根节点为当前图形用户界面的视图,根节点的可执行任务数量代表经可执行校验后当前场景中中的可执行的控件数量。在执行过程中,经可执行校验命中的节点处,对其可执行数量进行计数,进而对任务树由下至上,累加各个节点的可执行数量,直至根节点处。
任务动作是对可执行数量的判断,如果可执行数量为1,那么任务动作为执行,如果可执行大于1,需要根据对话状态信息进行进一步的判断,例如当前对话轮数小于轮数阈值,则任务动作可能是澄清或引导,如果当前对话轮数大于或等于轮数阈值,则任务动作是失败。
其中,引导是指以示范性的反馈信息对用户形成教导,引导用户以反馈信息的样式进行语音交互,从而可以输入语义更加明确的语音请求。例如,“请问您是要打开左前门车窗吗”,“请以打开左前门车窗的表述对我重新下达指令”等内容的反馈。
澄清是指以询问的方式使得用户在后续对话中能够对首轮对话中不清楚的请求进行解释澄清,从而可以明确用户的语义。例如“请问您要打开哪个车窗”,“请问您打开多大高度”等内容的反馈。
请参阅图11,在一个示例中,用户通过语音与车辆进行交互,发出“打开车窗”语音请求,经语义理解后命中五个按钮,“打开左前门车窗”、“打开左后门车窗”、“打开右前门车窗”、“打开右后门车窗”、“打开天窗”以及其的点击操作。经可执行校验后,点击操作的可执行条件为控件和操作,五个按钮满足均可执行的条件。经统计,任务树中的根节点可执行任务数量为5,不可直接执行。因此会生成用于引导或澄清的反馈信息下发车辆,由车辆播报。
而任务动作是引导或澄清则根据对话状态信息的具体情况确定。例如当前为首轮对话,由于轮次较少,未达到3轮的上限,可确定任务动作为澄清。又如,当前为次轮对话,根据该轮次的语音请求,仍然无法唯一确定可执行任务,那么就需要引导用户在第三轮对话中准确表达,否则可能会导致在第三轮对话后交互结束,此时,可确定任务动作为引导。
请参阅图12,在某些实施方式中,S30包括:
S31:将用于进行融合处理的输入信息进行分层处理以得到第一层语音请求信息和第二层语音请求信息;
S32:根据预定的策略分别匹配第一层语音请求信息和第二层语音请求信息的策略并得到对应的优先级;
S33:对第一层语音请求信息的优先级和第二层语音请求信息的优先级进行融合处理以确定对应任务动作的任务。
在某些实施方式中,S31-S33可以由任务确定模块106实现。或者说任务确定模块用于将用于进行语义理解的输入信息进行分层处理以得到第一层语音请求信息和第二层语音请求信息,及用于根据预定的策略分别匹配第一层语音请求信息和第二层语音请求信息的策略并得到对应的优先级,以及用于对第一层语音请求信息的优先级和第二层语音请求信息的优先级进行融合处理以确定对应任务动作的任务。
在某些实施方式中,处理器用于将用于进行语义理解的输入信息进行分层处理以得到第一层语音请求信息和第二层语音请求信息,及用于根据预定的策略分别匹配第一层语音请求信息和第二层语音请求信息的策略并得到对应的优先级,以及用于对第一层语音请求信息的优先级和第二层语音请求信息的优先级进行融合处理以确定对应任务动作的任务。
具体地,在确定任务动作后,需要进一步确认对应的任务。本实施方式中,语音请求对应的任务融合多路信息来源生成。多路信息来源包括全局语义理解的融合结果、场景语义理解的融合结果、场景数据、上下文信息等。在实际处理过程中,将包含上述信息来源的任务追踪器和场景数据作为输入数据经过预定的策略处理,确定最终的任务。
本实施方式中,将多路输入信息进行分层处理,分别判断分层后的信息对应的执行策略的优先级,并将各分层的执行策略的优先级进行融合,得到最终的任务。如此,通过分层的方式降低任务确定处理的复杂度,使得在多轮语音交互过程中的任务具有明确的生成确定策略。
预定的策略可以是预存储于服务器的存储器中的策略列表,其中包括多个执行策略以及相对应的优先级评分。不同分层信息可能匹配到不同的执行策略,而不同的执行策略又对应不同的优 先级评分,在优先级不同的情况下,进行融合处理,得到最终的语音请求的执行策略。
其中融合处理的策略可以是取高分策略,分数越高表明其命中的策略优先级越高,优先执行更贴合用户当前轮次的语音请求的意图。当然也可以是根据优先级和权重进行综合考量。
请参阅图13,在某些实施方式中,S31包括:
S311:提取输入信息中的特征信息;
S312:以对话状态信息为要素将特征信息划分为第一层语音请求信息;
S313:以语义理解为要素将特征信息划分为第二层语音请求信息。
在某些实施方式中,S311-S313可以由任务确定模块106实现。或者说,任务确定模块106用于提取输入信息中的特征信息,及用于以对话状态信息为要素将特征信息划分为第一层语音请求信息,以及用于以语义理解为要素将特征信息划分为第二层语音请求信息。
在某些实施方式中,处理器用于提取输入信息中的特征信息,及用于以对话状态信息为要素将特征信息划分为第一层语音请求信息,以及用于以语义理解为要素将特征信息划分为第二层语音请求信息。
具体地,特征信息是语音请求在进行自然语言理解等处理后得到的信息,包括但不限于经过全局语义理解后命中的领域名称、意图、意图ID等,经过场景语义理解后命中的场景ID、元素ID、结合场景数据得到该场景ID的属性信息等,对话状态跟踪器中记录的多轮对话的时域ID、是否在剧本中等状态信息等。
提取特征信息也即是进行策略匹配前预处理过程,将上述特征信息转换为策略匹配引擎能够接受的属性标签,或者说能够进行任务匹配的判断条件,每个分层可能具有多个属性值,将多个属性值组装成一个对象,也即是该层对应的语音请求信息。
预处理过程包括通过场景数据、场景语义理解判断命中的场景元素并转化为可用于分层的属性标签,及通过全局语义理解转化为场景语义理解并转化为可用于分层的属性标签,以及获取任务追踪器中的相关信息等。其中,全局语义理解转化为场景语义理解这种情况发生在场景语义理解未召回且在预定义的映射库里全局语义能够匹配到场景语义的情况下进行,如此,可以完善业务,对场景信息进行有效的补充。预定的分层规则也即是将上述组装成的对象依照一定的规则进行划分。或者说依照策略匹配规则将组合成的对象划分为第一层和第二层。
需要说明地,所述的第一层和第二层并不存在层级或层次上的包含、递进等关系,可以看作是两个并列对象进行不同方式的任务匹配。
任务追踪器信息包括是否开启场景任务、当前对话进行的轮数、对话是否在剧本中、是否有全局语义、是否场景语义等状态信息。在一个示例中,语音请求经过语义理解后,提取特征信息,并依据任务追踪器信息为要素将部分特征信息组合成第一层语音请求信息,从而可以通过对话状态信息匹配任务策略。
相类似地,语义理解信息包括场景语义理解的融合结果、全局语义理解的融合结果和场景数据信息。
第二层语音请求信息是将特征信息以语义理解信息为要素将部分特征信息组合得到,从而可以通过语义理解信息匹配任务策略。
在一个示例中,在导航页面中,语音请求“调高音量”,特征信息可包括:经全局语义理解,操作为调节,对象为音量;经场景语义理解,命中导航音量滑块,场景数据有导航地图页面有音量调高相关控件,对话状态跟踪器中的信息包括对话为第N轮,不在剧本中,场景任务命中多个音量调高控件,场景需澄清。根据相关规则,将特征信息划分为第一层语音请求信息包括对话为第N轮,在剧本中,调高命中多个场景需澄清。第二层语音请求信息包括场景页面为导航,场景语义命中导航音量滑块。进而,第一层语音请求信息匹配的策略为“场景澄清”,第二层语音请求信息匹配的策略为“场景优先”。
请参阅图14,在某些实施方式中,S32包括:
S321:根据对话状态信息,自预定的策略匹配第一层语音请求信息的第一任务并获取第一任务的优先级;
S322:根据语义理解的结果信息,自预定的策略匹配第二层语音请求信息的第二任务并获取第二任务的优先级。
在某些实施方式中,S321、S322可以由任务确定模块106实现。或者说,任务确定模块106用于根据对话状态信息,自预定的策略匹配第一层语音请求信息的第一任务并获取第一任务的优先级,以及用于根据语义理解的结果信息,自预定的策略匹配第二层语音请求信息的第二任务并获取第二任务的优先级。
在某些实施方式中,处理器用于根据对话状态信息,自预定的策略匹配第一层语音请求信息的第一任务并获取第一任务的优先级,以及用于根据语义理解的结果信息,自预定的策略匹配第二层语音请求信息的第二任务并获取第二任务的优先级。
具体地,语音请求在经过预处理后重新组装成为具有不同判断条件的对象,即第一层语音请求信息和第二层语音请求信息。不同层语音请求信息具有对应的匹配规则集,在一个示例中,例如对于第一层语音请求信息的匹配规则包括判断是否是场景任务、对话状态的走向预测、对话轮数,是否在剧本中等。第二语音请求信息的匹配规则包括场景id、意图映射的策略标签等。
根据这些匹配规则去为不同分层的语音请求信息在预定的策略中匹配相对应的任务,同时获取该任务对应的优先级。
预定的策略中有部分是相对独立且可以确定的默认策略,例如,在场景中语音请求命中GUI中两个元素,在这种情况下,需要用户进行澄清。在实施过程中,也会遇到一些默认策略外的情况,面对这些情况,通过自定义策略进行补充,自定义策略可以是热更新的,方便维护人员随时进行补充,热更新可以随时进行动态添加,无需对原有策略的架构进行修改,更加快捷方便并且易于维护。例如,在对话状态在导航剧本中,语音请求是放大地图,而放大地图并不是剧本的动作,此时,需要将剧本任务暂停,优先执行放大地图命令,也即是说,在这种情况下,需要调高放大地图命令的优先级,可以自定义一个策略,来调高放大地图命令的优先级。
也即是说,预定的策略包括默认策略和自定义策略两部分,其中默认策略部分用于第一层语音请求信息匹配任务,自定义策略用于第二层语音请求信息匹配任务。
请参阅图15,在某些实施方式中,S33包括:
S331:根据预定的策略,比较第一任务的优先级和第二任务的优先级对应的分值;
S332:根据比较结果将分值更高的任务确定为对应任务动作的任务。
在某些实施方式中,S331和S332可以由任务确定模块106实现。或者说,任务模块106用于根据预定的策略,比较第一任务的优先级和第二任务的优先级对应的分值以及根据比较结果将分值更高的任务确定为对应任务动作的任务。
在某些实施方式中,处理器用于根据预定的策略,比较第一任务的优先级和第二任务的优先级对应的分值以及根据比较结果将分值更高的任务确定为对应任务动作的任务。
具体地,预定策略的列表中对应每个任务提供一个优先级分值,在匹配到相应的任务后确定相应的优先级分值。本实施方式中,对于具有不同优先级分值的两个任务采用高分优先的融合策略,也即是最终的任务为优先级分值更高的一个。
具体地,当在任务匹配的过程中匹配到场景转全局的任务时,需要进行转换处理,从而将场景信息转换为全局任务。其目的在于构建全局语义的输入格式。如果不进行任务的转化,对应语音请求的任务将无法执行。
在一个示例中,在导航页面中,输入语音请求“搜索充电桩”,经过上述分层、匹配、融合的策略后最终确定的任务为场景转全局。具体而言,该语音请求经过场景语义理解命中场景界面中的充电桩元素,对应的动作是以当前位置为中心搜索预定范围内的充电桩。经过全局语义理解命中搜索充电桩,对应的动作是搜索当前导航路线中沿途的充电桩。这种情况下,根据先关信息可知当前处于导航场景下又命中了充电桩元素,那么就将场景语义搜索附近的充电桩转换为全局语义搜索沿途的充电桩,并生成相应的任务。
又一个示例中,在一个剧本任务中,生成一个新的单轮任务。在这种情况下,涉及任务的挂起处理。具体而言,任务为历史全局剧本,条件为不触发页面跳转的单轮指令,任务为挂起历史全局剧本,执行单轮指令。此外,对于某些控件的点击事件虽然跳转了页面,但不会影响历史全局剧本任务,也可挂起历史全局剧本任务。
在一个示例中,当前图形用户界面为导航地图界面,用户在首轮发出语音请求“导航去A地点”,根据相关的语义理解确定是导航的意图,并且触发进入导航对话剧本。首轮确定的任务是全局剧本任务,系统反馈“为您找到三个地点,请问去哪一个?”,图形用户界面中会在导航地图页面上提供找到的三个地点的列表。
次轮,用户希望将地图放大后再确认实际的目的地,发出语音请求“放大地图”。任务动作确定的过程如下:场景语义理解为地图比例尺上滑,场景信息地图页面有放大地图的按钮。经可执行校验后,滑块支持上滑操作,任务树可执行数量为1,确定对话动作为执行。对话状态为执行地图比例尺滑块上滑。
任务确定的过程如下,一方面,由对话状信息中记录的参数可获知当前为场景语义、对话第二轮,不在对话剧本中,对话状态走向是执行等,该部分信息可作为第一层语音请求信息。另一方面根据不同语义理解可得到对地图进行放大操作、对地图界面中的比例尺滑块进行放大操作的结果,该部分信息可作为第二层语音请求信息。两层语音请求信息分别对应场景执行的任务以及场景优先的任务。其中,场景优先的任务的优先级高于场景执行的任务,经过融合处理后,次轮的语音请求的任务为场景优先,即优先执行对导航地图中的比例尺滑块上滑的操作。同时,由于当前场景为导航、且上一轮为全局导航剧本,当前场景语义命中比例尺滑块且执行,此时,将挂起全局导航剧本任务。
本申请实施方式还提供了一种计算机可读存储介质。一个或多个包含计算机可执行指令的非易失性计算机可读存储介质,当计算机可执行指令被一个或多个处理器执行时,使得处理器执行上述任一实施方式的车辆的语音交互方法。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,程序可存储于一非易失性计算机可读存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解 为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (15)

  1. 一种用于车辆的语音交互方法,其特征在于,包括:
    对语音请求进行语义理解的结果进行融合处理;
    对所述融合处理的结果进行校验以对对话状态信息进行更新从而确定对应所述语音请求的任务动作;
    根据预定规则进行决策以确定对应所述任务动作的任务。
  2. 根据权利要求1所述语音交互方法,其特征在于,所述对语音请求进行语义理解的结果进行融合处理包括:
    对所述语音请求进行全局语义理解以得到第一理解结果;
    对所述语音请求进行场景语义理解以得到第二理解结果。
  3. 根据权利要求2所述的语音交互方法,其特征在于,所述对所述语音请求进行全局语义理解以得到第一理解结果包括:
    根据预存的模板对所述语音请求信息进行自然语言理解处理以得到第一理解结果。
  4. 根据权利要求3所述的语音交互方法,其特征在于,所述根据预存的模板对所述语音请求信息进行自然语言理解处理以得到第一理解结果包括:
    基于预定模板对所述语音请求进行自然语言理解以生成第一理解子结果;
    基于预定的分类模型对所述语音请求进行自然语言理解以生成第二理解子结果;
    将所述第一理解子结果和所述第二理解子结果进行融合以得到所述第一理解结果。
  5. 根据权利要求2所述的语音交互方法,其特征在于,所述对所述语音请求进行场景语义理解以得到第二理解结果包括:
    结合图形用户界面对所述语音请求信息进行自然语言理解处理以生成第二理解结果。
  6. 根据权利要求5所述的语音交互方法,其特征在于,所述结合图形用户界面对所述语音请求信息进行自然语言理解处理以生成第二理解结果包括:
    基于图形用户界面对所述语音请求进行自然语言理解以生成第三理解子结果;
    基于知识推理对所述语音请求进行自然语言理解以生成第四理解子结果;
    将所述第三理解子结果和所述第四理解子结果进行融合以得到所述第二理解结果。
  7. 根据权利要求1所述的语音交互方法,其特征在于,所述对所述融合处理的结果进行校验以对对话状态信息进行更新从而确定对应所述语音请求的任务动作包括:
    对所述融合处理的结果进行可执行校验;
    根据所述可执行校验的结果对所述对话状态信息进行更新。
  8. 根据权利要求7所述的语音交互方法,其特征在于,所述根据所述可执行校验的结果对所述对话状态信息进行更新包括:
    根据所述校验的结果更新在所述对话状态信息的任务树中命中的节点;
    根据所述节点的数量确定所述对应所述语音请求的任务的数量;
    根据所述任务数量的确定对应所述语音请求的任务动作。
  9. 根据权利要求1所述的语音交互方法,其特征在于,所述根据预定规则进行决策以确定对应所述任务动作的任务包括:
    将用于进行所述融合处理的输入信息进行分层处理以得到第一层语音请求信息和第二层语音请求信息;
    根据预定的策略分别匹配所述第一层语音请求信息和所述第二层语音请求信息的策略并得到对应的优先级;
    对所述第一层语音请求信息的优先级和所述第二层语音请求信息的优先级进行融合处理以确定所述任务。
  10. 根据权利要求9所述的语音交互方法,其特征在于,所述将用于进行语义理解的输入信息进行分层处理以得到第一层语音请求信息和第二层语音请求信息包括:
    提取所述输入信息中的特征信息;
    以所述对话状态信息为要素将所述特征信息划分为所述第一层语音请求信息;
    以语义理解为要素将所述特征信息划分为所述第二层语音请求信息。
  11. 根据权利要求10所述的语音交互方法,其特征在于,所述根据预定的策略分别匹配所述第一层语音请求信息和所述第二层语音请求信息的策略并得到对应的优先级包括:
    根据所述对话状态信息,自所述预定的策略匹配所述第一层语音请求信息的第一任务并获取所述第一任务的优先级;
    根据所述语义理解的结果信息,自所述预定的策略匹配所述第二层语音请求信息的第二任务并获取所述第二任务的优先级。
  12. 根据权利要求10所述的语音交互方法,其特征在于,所述对所述第一层语音请求信息的优先级和所述第二层语音请求信息的优先级进行融合处理以确定所述任务包括:
    根据所述预定的策略,比较所述第一任务的优先级和所述第二任务的优先级对应的分值;
    根据比较结果将所述分值更高的任务确定为所述任务动作的任务。
  13. 一种服务器,其特征在于,包括:
    语义融合模块,用于对语音请求进行语义理解的结果进行融合处理;
    动作确定模块,用于根据所述语义理解的结果确定对应所述语音请求的任务动作;
    任务确定模块,用于根据预定规则进行决策以确定对应所述任务动作的任务。
  14. 一种服务器,其特征在于,包括存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,实现权利要求1-12任一项所述的语音交互方法。
  15. 一种计算机程序的非易失性计算机可读存储介质,其特征在于,当所述计算机程序被一个或多个处理器执行时,实现权利要求1-12中任一项所述的语音交互方法。
PCT/CN2020/140940 2020-09-18 2020-12-29 语音交互方法、服务器和计算机可读存储介质 WO2022057152A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010986214.6A CN112164400A (zh) 2020-09-18 2020-09-18 语音交互方法、服务器和计算机可读存储介质
CN202010986214.6 2020-09-18

Publications (1)

Publication Number Publication Date
WO2022057152A1 true WO2022057152A1 (zh) 2022-03-24

Family

ID=73858281

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/140940 WO2022057152A1 (zh) 2020-09-18 2020-12-29 语音交互方法、服务器和计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN112164400A (zh)
WO (1) WO2022057152A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115394300A (zh) * 2022-10-28 2022-11-25 广州小鹏汽车科技有限公司 语音交互方法、语音交互装置、车辆和可读存储介质
CN116016578A (zh) * 2022-11-22 2023-04-25 中国第一汽车股份有限公司 一种基于设备状态和用户行为的智能语音引导方法
CN116092494A (zh) * 2023-04-07 2023-05-09 广州小鹏汽车科技有限公司 语音交互方法、服务器和计算机可读存储介质
CN116564316A (zh) * 2023-07-11 2023-08-08 北京边锋信息技术有限公司 一种语音人机交互方法、装置

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113053394B (zh) * 2021-04-27 2024-01-09 广州小鹏汽车科技有限公司 语音处理方法、服务器、语音处理系统和存储介质
CN113421561B (zh) * 2021-06-03 2024-01-09 广州小鹏汽车科技有限公司 语音控制方法、语音控制装置、服务器和存储介质
CN113658585B (zh) * 2021-08-13 2024-04-09 北京百度网讯科技有限公司 语音交互模型的训练方法、语音交互方法及装置
CN116168704B (zh) * 2023-04-26 2023-07-18 长城汽车股份有限公司 语音交互的引导方法、装置、设备、介质及车辆

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110299136A (zh) * 2018-03-22 2019-10-01 上海擎感智能科技有限公司 一种用于语音识别的处理方法及其系统
CN110574105A (zh) * 2018-03-07 2019-12-13 谷歌有限责任公司 用于基于语音发起定制装置动作的系统和方法
CN110807333A (zh) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 一种语义理解模型的语义处理方法、装置及存储介质
CN110875039A (zh) * 2018-08-30 2020-03-10 阿里巴巴集团控股有限公司 语音识别方法和设备
CN111081220A (zh) * 2019-12-10 2020-04-28 广州小鹏汽车科技有限公司 车载语音交互方法、全双工对话系统、服务器和存储介质
CN111145742A (zh) * 2019-12-18 2020-05-12 中国人民武装警察部队警官学院 一种基于语音指令的预案指挥执行方法和系统

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9436759B2 (en) * 2007-12-27 2016-09-06 Nant Holdings Ip, Llc Robust information extraction from utterances
CN103020083B (zh) * 2011-09-23 2016-06-15 北京百度网讯科技有限公司 需求识别模板的自动挖掘方法、需求识别方法及对应装置
US8843470B2 (en) * 2012-10-05 2014-09-23 Microsoft Corporation Meta classifier for query intent classification
CN104360897B (zh) * 2014-10-29 2017-09-22 百度在线网络技术(北京)有限公司 对话处理方法和对话管理系统
CN108701459A (zh) * 2015-12-01 2018-10-23 纽昂斯通讯公司 将来自各种语音服务的结果表示为统一概念知识库
CN105808525B (zh) * 2016-03-29 2018-06-29 国家计算机网络与信息安全管理中心 一种基于相似概念对的领域概念上下位关系抽取方法
CN107316643B (zh) * 2017-07-04 2021-08-17 科大讯飞股份有限公司 语音交互方法及装置
CN108874917B (zh) * 2018-05-30 2021-11-23 北京五八信息技术有限公司 意图识别方法、装置、设备及存储介质
CN111292751B (zh) * 2018-11-21 2023-02-28 北京嘀嘀无限科技发展有限公司 语义解析方法及装置、语音交互方法及装置、电子设备
CN111508482A (zh) * 2019-01-11 2020-08-07 阿里巴巴集团控股有限公司 语义理解及语音交互方法、装置、设备及存储介质
CN109754804A (zh) * 2019-02-21 2019-05-14 珠海格力电器股份有限公司 一种语音控制方法、装置、存储介质及智能家居系统
CN110109541B (zh) * 2019-04-25 2022-04-05 广州智伴人工智能科技有限公司 一种多模态交互的方法
CN110377908B (zh) * 2019-07-19 2023-05-30 科大讯飞股份有限公司 语义理解方法、装置、设备及可读存储介质
CN110728981A (zh) * 2019-10-09 2020-01-24 北京达佳互联信息技术有限公司 一种交互功能的执行方法、装置、电子设备及存储介质
CN111008532B (zh) * 2019-12-12 2023-09-12 广州小鹏汽车科技有限公司 语音交互方法、车辆和计算机可读存储介质
CN111429903B (zh) * 2020-03-19 2021-02-05 百度在线网络技术(北京)有限公司 音频信号识别方法、装置、系统、设备和可读介质
CN111767021A (zh) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 语音交互方法、车辆、服务器、系统和存储介质
CN112164402B (zh) * 2020-09-18 2022-07-12 广州小鹏汽车科技有限公司 车辆语音交互方法、装置、服务器和计算机可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110574105A (zh) * 2018-03-07 2019-12-13 谷歌有限责任公司 用于基于语音发起定制装置动作的系统和方法
CN110299136A (zh) * 2018-03-22 2019-10-01 上海擎感智能科技有限公司 一种用于语音识别的处理方法及其系统
CN110875039A (zh) * 2018-08-30 2020-03-10 阿里巴巴集团控股有限公司 语音识别方法和设备
CN110807333A (zh) * 2019-10-30 2020-02-18 腾讯科技(深圳)有限公司 一种语义理解模型的语义处理方法、装置及存储介质
CN111081220A (zh) * 2019-12-10 2020-04-28 广州小鹏汽车科技有限公司 车载语音交互方法、全双工对话系统、服务器和存储介质
CN111145742A (zh) * 2019-12-18 2020-05-12 中国人民武装警察部队警官学院 一种基于语音指令的预案指挥执行方法和系统

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115394300A (zh) * 2022-10-28 2022-11-25 广州小鹏汽车科技有限公司 语音交互方法、语音交互装置、车辆和可读存储介质
CN116016578A (zh) * 2022-11-22 2023-04-25 中国第一汽车股份有限公司 一种基于设备状态和用户行为的智能语音引导方法
CN116016578B (zh) * 2022-11-22 2024-04-16 中国第一汽车股份有限公司 一种基于设备状态和用户行为的智能语音引导方法
CN116092494A (zh) * 2023-04-07 2023-05-09 广州小鹏汽车科技有限公司 语音交互方法、服务器和计算机可读存储介质
CN116092494B (zh) * 2023-04-07 2023-08-25 广州小鹏汽车科技有限公司 语音交互方法、服务器和计算机可读存储介质
CN116564316A (zh) * 2023-07-11 2023-08-08 北京边锋信息技术有限公司 一种语音人机交互方法、装置
CN116564316B (zh) * 2023-07-11 2023-11-03 北京边锋信息技术有限公司 一种语音人机交互方法、装置

Also Published As

Publication number Publication date
CN112164400A (zh) 2021-01-01

Similar Documents

Publication Publication Date Title
WO2022057152A1 (zh) 语音交互方法、服务器和计算机可读存储介质
Uc-Cetina et al. Survey on reinforcement learning for language processing
JP6562982B2 (ja) 対話システム、対話方法、および対話システムを適合させる方法
JP6448723B2 (ja) 対話システム、対話方法、および対話システムを適合させる方法
US7627466B2 (en) Natural language interface for driving adaptive scenarios
CN112164401B (zh) 语音交互方法、服务器和计算机可读存储介质
Martin TYCOON: Theoretical framework and software tools for multimodal interfaces
CN106202270B (zh) 基于自然语言的人机对话方法及装置
US11562744B1 (en) Stylizing text-to-speech (TTS) voice response for assistant systems
CN112102832B (zh) 语音识别方法、装置、服务器和计算机可读存储介质
US11790897B2 (en) Response generation for conversational computing interface
CN109299245B (zh) 知识点召回的方法和装置
CN111008532A (zh) 语音交互方法、车辆和计算机可读存储介质
CN108780444B (zh) 可扩展设备和依赖于域的自然语言理解
JP7213943B2 (ja) 車載機器の音声処理方法、装置、機器及び記憶媒体
US20210380118A1 (en) Method and apparatus for regulating user emotion, device, and readable storage medium
CN113468302A (zh) 组合共享询问线的多个搜索查询的参数
US11257482B2 (en) Electronic device and control method
CN113421561B (zh) 语音控制方法、语音控制装置、服务器和存储介质
US10762902B2 (en) Method and apparatus for synthesizing adaptive data visualizations
US8498991B2 (en) Neighborhood guide for semantic search system and method to support local POI discovery
CN115457959A (zh) 语音交互方法、服务器及计算机可读存储介质
Wang et al. Significance of phonological features in speech emotion recognition
CN114822533A (zh) 语音交互方法、模型训练方法、电子设备和存储介质
Papangelis et al. Ld-sds: Towards an expressive spoken dialogue system based on linked-data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20954005

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20954005

Country of ref document: EP

Kind code of ref document: A1