WO2022252946A1 - Voice control method, voice control device, server, and storage medium - Google Patents

Voice control method, voice control device, server, and storage medium Download PDF

Info

Publication number
WO2022252946A1
WO2022252946A1 PCT/CN2022/092246 CN2022092246W WO2022252946A1 WO 2022252946 A1 WO2022252946 A1 WO 2022252946A1 CN 2022092246 W CN2022092246 W CN 2022092246W WO 2022252946 A1 WO2022252946 A1 WO 2022252946A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
semantic understanding
scene
semantic
information
Prior art date
Application number
PCT/CN2022/092246
Other languages
French (fr)
Chinese (zh)
Inventor
赵耀
易晖
翁志伟
Original Assignee
广州小鹏汽车科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州小鹏汽车科技有限公司 filed Critical 广州小鹏汽车科技有限公司
Publication of WO2022252946A1 publication Critical patent/WO2022252946A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the voice control device when the voice control device handles complex tasks, it will ask the user as many task details as possible, and the voice control device can understand the specific wishes of the user only after multiple rounds of voice dialogues with the user.
  • this kind of multi-round speech dialogue needs to complete the description of the user's intention by fusing the semantic understanding results of a single round and the information of multiple rounds.
  • the receiving the voice instruction of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the previous round to generate a dynamic scene include: after receiving the current round In the case of a round of voice commands, the semantic space is determined according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the current round of voice commands; the dynamic scene is determined according to the semantic space and GUI information .
  • the voice command of the current round is received, and the voice command of the previous round may be: the command "OK to close” issued by the user.
  • the voice dialogue information of the historical rounds including: the user's command "turn off the low-speed analog tone” and the system confirms "low-speed analog tone Can remind pedestrians, reduce safety risks, are you sure to close it?"
  • the semantic space is determined by the speech dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the current round of speech instructions.
  • Semantic space can be understood as a certain semantic range. Semantic space can include static semantic space and dynamic semantic space.
  • the similarity between the document data of the scene semantic document and the dynamic scene element is greater than a similarity threshold.
  • semantic understanding results there are two kinds of semantic understanding results: one is to use the semantic understanding corresponding to the voice command of the current round as the semantic understanding result, and the other is to use the global semantic understanding as the semantic understanding result.
  • step 05 includes the steps of:
  • the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 051 and step 052 can be implemented by the processor 300, that is to say, the processor 300 can be used to :
  • the processor 300 can be used to :
  • the voice dialogue information of the previous round can be updated, and the updating process can be realized by the dialogue state information update module and the dialogue strategy optimization module.
  • the dialogue state information update module and the dialogue policy optimization module can be combined together, that is, the dialogue management module. Updating dialog information includes updating dialog action information and dialog state information. After updating and optimization, reply information (operation instruction) can be generated, so that the operation instruction can be sent to make the vehicle 1000 perform the corresponding operation.
  • the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein, step 0513, step 0514 and step 0515 can all be implemented by the processor 300, that is to say, the processor 300 can be used to: determine the priority order of multiple scene pages in a dynamic scene; push high-priority scene page nodes into the low-priority scene page stack according to the priority order of multiple scene pages; control the vehicle 1000 to execute high-priority The corresponding operation corresponding to the scene page.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A voice control method, a voice control device (100), a server (500), and a storage medium (800). The voice control method comprises: receiving a voice instruction of a current round, receiving graphical user interface information, and fusing the graphical user interface information and voice dialogue information of a historical round to generate a dynamic scene (01); generating a scene semantic document according to the dynamic scene (02); determining, according to the scene semantic document, a semantic understanding corresponding to the voice instruction of the current round (03); determining a semantic understanding result according to the semantic understanding corresponding to the voice instruction of the current round or a global semantic understanding (04); and controlling, according to the semantic understanding result, a vehicle (1000) to perform a corresponding operation (05).

Description

语音控制方法、语音控制装置、服务器和存储介质Voice control method, voice control device, server and storage medium
优先权信息priority information
本申请请求2021年06月03日向中国国家知识产权局提交的、专利申请号为202110619459.X的专利申请的优先权和权益,并且通过参照将其全文并入此处。This application claims priority and rights to the patent application No. 202110619459.X filed with the State Intellectual Property Office of China on June 3, 2021, and is hereby incorporated by reference in its entirety.
技术领域technical field
本申请涉及语音识别技术领域,特别涉及一种语音控制方法、语音控制装置、服务器和存储介质。The present application relates to the technical field of voice recognition, in particular to a voice control method, a voice control device, a server and a storage medium.
背景技术Background technique
在相关技术中,语音控制装置在处理复杂任务的情况下,会尽可能多的向用户询问任务细节,语音控制装置与用户进行多轮语音对话,才能够了解到用户的具体意愿。然而,这种多轮语音对话需要通过融合单轮的语义理解结果和多轮信息,以完成对用户意愿的描述。但是语音控制装置在多垂域(如多轮语音对话)的场景下,不易拓展到各个垂域。并且随着垂域的增加,语义理解精度下降,最终导致用户体验不佳。In the related art, when the voice control device handles complex tasks, it will ask the user as many task details as possible, and the voice control device can understand the specific wishes of the user only after multiple rounds of voice dialogues with the user. However, this kind of multi-round speech dialogue needs to complete the description of the user's intention by fusing the semantic understanding results of a single round and the information of multiple rounds. However, it is difficult for the voice control device to expand to each vertical domain in the scenario of multiple vertical domains (such as multiple rounds of voice dialogue). And as the vertical domain increases, the accuracy of semantic understanding decreases, which ultimately leads to poor user experience.
发明内容Contents of the invention
本申请的实施方式提供一种语音控制方法、语音控制装置、服务器和存储介质。Embodiments of the present application provide a voice control method, a voice control device, a server, and a storage medium.
本申请实施方式的语音控制方法包括:接收当前轮次语音指令,接收图形用户界面信息,融合所述图形用户界面信息和历史轮次的语音对话信息以生成动态场景;根据所述动态场景生成场景语义文档;根据所述场景语义文档确定所述当前轮次语音指令对应的语义理解;根据所述当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果;根据所述语义理解结果控制车辆执行相应操作。The voice control method of the embodiment of the present application includes: receiving the voice command of the current round, receiving the GUI information, fusing the GUI information and the voice dialogue information of the previous round to generate a dynamic scene; generating a scene according to the dynamic scene Semantic documents; determine the semantic understanding corresponding to the current round of voice commands according to the scene semantic documents; determine the semantic understanding results according to the semantic understanding or global semantic understanding corresponding to the current round of voice commands; control the semantic understanding according to the semantic understanding results The vehicle performs the appropriate action.
在某些实施方式中,所述接收当前轮次语音指令,接收图形用户界面信息,融合所述图形用户界面信息和历史轮次的语音对话信息以生成动态场景,包括:在接收到所述当前轮次语音指令的情况下,根据历史轮次的语音对话信息确定语义空间,所述语义空间用于表征当前轮次语音指令的语义理解指向;根据所述语义空间和图形用户界面信息确定动态场景。In some implementations, the receiving the voice instruction of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the previous round to generate a dynamic scene include: after receiving the current round In the case of a round of voice commands, the semantic space is determined according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the current round of voice commands; the dynamic scene is determined according to the semantic space and GUI information .
在某些实施方式中,所述接收当前轮次语音指令,接收图形用户界面信息,融合所述图形用户界面信息和历史轮次的语音对话信息以生成动态场景,包括:在接收到所述当前轮次语音指令的情况下,载入并解析所述历史轮次的语音对话信息中包括的动态场景元素;根据所述动态场景元素、历史轮次的语音对话信息生成动态场景。In some implementations, the receiving the voice instruction of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the previous round to generate a dynamic scene include: after receiving the current round In the case of a round voice command, load and analyze the dynamic scene elements included in the historical round voice dialogue information; generate a dynamic scene according to the dynamic scene elements and the historical round voice dialogue information.
在某些实施方式中,所述场景语义文档的文档数据与所述动态场景元素的相似度大于相似度阈值。In some implementations, the similarity between the document data of the scene semantic document and the dynamic scene element is greater than a similarity threshold.
在某些实施方式中,所述根据所述当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果,包括:利用所述当前轮次语音指令对应的语义理解搜索数据库;在搜索结果存在与所述当前轮次 语音指令对应的语义理解相匹配的结果时,将所述当前轮次语音指令对应的语义理解作为所述语义理解结果;在搜索结果不存在与所述当前轮次语音指令对应的语义理解相匹配的结果时,将所述全局语义理解作为所述语义理解结果。In some implementations, the determining the semantic understanding result according to the semantic understanding corresponding to the voice instruction of the current round or the global semantic understanding includes: using the semantic understanding corresponding to the voice instruction of the current round to search the database; When there is a result matching the semantic understanding corresponding to the voice command of the current round, use the semantic understanding corresponding to the voice command of the current round as the semantic understanding result; if there is no search result matching the voice command of the current round When the semantic understanding corresponding to the instruction matches the result, the global semantic understanding is taken as the semantic understanding result.
在某些实施方式中,所述根据所述语义理解结果控制车辆执行相应操作,包括:在所述当前轮次语音指令对应的语义理解作为所述语义理解结果时,对所述历史轮次的语音对话信息进行更新,并发送操作指令以使车辆执行相应操作;在所述全局语义理解作为所述语义理解结果时,控制所述车辆发起新一轮对话任务。In some implementations, the controlling the vehicle to perform corresponding operations according to the semantic understanding result includes: when the semantic understanding corresponding to the voice command of the current round is used as the semantic understanding result, the historical round The voice dialogue information is updated, and an operation instruction is sent to enable the vehicle to perform a corresponding operation; when the global semantic understanding is the result of the semantic understanding, the vehicle is controlled to initiate a new round of dialogue tasks.
在某些实施方式中,所述对所述历史轮次的语音对话信息进行更新,包括:根据所述历史轮次的语音对话信息,查询用户输出的对话动作信息和系统输出的对话动作信息,以获取用户槽位参数和系统槽位参数;利用所述用户槽位参数和所述系统槽位参数执行槽位动作,更新可信槽位参数,以更新对话状态信息。In some implementations, the updating the voice dialogue information of the historical round includes: querying the dialogue action information output by the user and the dialogue action information output by the system according to the voice dialogue information of the historical round, to obtain user slot parameters and system slot parameters; use the user slot parameters and the system slot parameters to execute slot actions, update trusted slot parameters, and update dialogue state information.
在某些实施方式中,所述执行槽位动作包括延续动作、删除动作、更新动作和失效动作中的至少一种。In some implementations, the slot execution action includes at least one of a continuation action, a delete action, an update action, and an invalidation action.
在某些实施方式中,所述对所述历史轮次的语音对话信息进行更新,包括:判断所述动态场景中多个场景页面的优先级顺序;根据所述多个场景页面的优先级顺序将高优先级的所述场景页面节点压入低优先级场景页面栈;控制所述车辆执行高优先级的所述场景页面对应的相应操作。In some implementations, the updating the voice dialogue information of the historical round includes: judging the priority order of multiple scene pages in the dynamic scene; according to the priority order of the multiple scene pages Pushing the high-priority scene page node into the low-priority scene page stack; controlling the vehicle to perform corresponding operations corresponding to the high-priority scene page.
本申请实施方式的语音控制装置包括:第一生成模块、第二生成模块、第一确定模块、第二确定模块和控制模块。所述第一生成模块用于接收当前轮次语音指令,接收图形用户界面信息,融合所述图形用户界面信息和历史轮次的语音对话信息以生成动态场景;所述第二生成模块用于根据所述动态场景生成场景语义文档;所述第一确定模块用于根据所述场景语义文档确定所述当前轮次语音指令对应的语义理解;所述第二确定模块用于根据所述当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果;所述控制模块用于根据所述语义理解结果控制车辆执行相应操作。The voice control device in the embodiment of the present application includes: a first generation module, a second generation module, a first determination module, a second determination module and a control module. The first generation module is used to receive the current round of voice instructions, receive the GUI information, and fuse the GUI information and the voice dialogue information of the historical round to generate a dynamic scene; the second generation module is used to generate a dynamic scene according to The dynamic scene generates a scene semantic document; the first determination module is used to determine the semantic understanding corresponding to the current round voice command according to the scene semantic document; the second determination module is used to determine the semantic understanding corresponding to the current round voice command according to the current round The semantic understanding or global semantic understanding corresponding to the voice command determines the semantic understanding result; the control module is used to control the vehicle to perform corresponding operations according to the semantic understanding result.
本申请实施方式的服务器包括存储器和处理器。所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,实现上述任一实施方式的语音控制方法。The server in the embodiments of the present application includes a memory and a processor. A computer program is stored in the memory, and when the computer program is executed by the processor, the voice control method in any one of the above-mentioned embodiments is implemented.
本申请实施方式的计算机程序的非易失性计算机可读存储介质,当所述计算机程序被一个或多个处理器执行时,实现上述任一实施方式的语音控制方法。The non-volatile computer-readable storage medium of the computer program in the embodiment of the present application, when the computer program is executed by one or more processors, implements the voice control method in any of the above embodiments.
本申请实施方式的语音控制方法、语音控制装置、服务器和存储介质,能够融合图形用户界面信息和历史轮次的语音对话信息以生成动态场景,根据动态场景生成场景语义文档,根据场景语义文档能够限制任务内的语义理解过程,对多轮状态的语音进行管理,从而提高这个垂域上多轮对话语义理解的精度。The voice control method, voice control device, server, and storage medium of the embodiments of the present application can fuse the graphical user interface information and the voice dialogue information of historical rounds to generate a dynamic scene, generate a scene semantic document according to the dynamic scene, and can generate a scene semantic document according to the scene semantic document. Limit the semantic understanding process within the task and manage the speech in multiple rounds, so as to improve the accuracy of semantic understanding of multiple rounds of dialogue in this vertical domain.
附图说明Description of drawings
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present application will become apparent and easy to understand from the following description of the embodiments in conjunction with the accompanying drawings, wherein:
图1是本申请实施方式的语音控制方法的流程示意图;FIG. 1 is a schematic flow diagram of a voice control method according to an embodiment of the present application;
图2是本申请实施方式的语音控制装置的模块示意图;FIG. 2 is a block diagram of a voice control device according to an embodiment of the present application;
图3是本申请实施方式的服务器的模块示意图;FIG. 3 is a schematic diagram of modules of a server according to an embodiment of the present application;
图4是本申请实施方式的车辆的示意图;Fig. 4 is a schematic diagram of a vehicle according to an embodiment of the present application;
图5是本申请实施方式的语音控制方法的交互示意图;FIG. 5 is an interactive schematic diagram of a voice control method according to an embodiment of the present application;
图6是本申请实施方式的语音控制方法的流程示意图;FIG. 6 is a schematic flowchart of a voice control method according to an embodiment of the present application;
图7至图9是本申请实施方式的语音控制方法的场景示意图;7 to 9 are schematic diagrams of scenarios of voice control methods according to embodiments of the present application;
图10是本申请实施方式的语音控制方法的流程示意图;FIG. 10 is a schematic flowchart of a voice control method according to an embodiment of the present application;
图11和图12是本申请实施方式的语音控制方法的场景示意图;FIG. 11 and FIG. 12 are schematic diagrams of scenarios of voice control methods according to embodiments of the present application;
图13至图15是本申请实施方式的语音控制方法的流程示意图;FIG. 13 to FIG. 15 are schematic flowcharts of voice control methods according to embodiments of the present application;
图16是本申请实施方式的语音控制方法的场景示意图;FIG. 16 is a schematic diagram of a scene of a voice control method according to an embodiment of the present application;
图17是本申请实施方式的语音控制方法的流程示意图;FIG. 17 is a schematic flowchart of a voice control method according to an embodiment of the present application;
图18是本申请实施方式的语音控制方法的场景示意图;FIG. 18 is a schematic diagram of a scene of a voice control method according to an embodiment of the present application;
图19是本申请实施方式的处理器和计算机可读存储介质的连接示意图。Fig. 19 is a schematic diagram of connection between a processor and a computer-readable storage medium according to an embodiment of the present application.
具体实施方式Detailed ways
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。Embodiments of the present application are described in detail below, examples of which are shown in the drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the figures are exemplary, and are intended to explain the present application, and should not be construed as limiting the present application.
在本申请的实施方式的描述中,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个所述特征。在本申请的实施方式的描述中,“多个”的含义是两个或两个以上,除非另有明确具体的限定。In the description of the embodiments of the present application, the terms "first" and "second" are used for description purposes only, and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of said features. In the description of the embodiments of the present application, "plurality" means two or more, unless otherwise specifically defined.
请参阅图1,本申请提供了一种语音控制方法。包括:Please refer to FIG. 1 , the present application provides a voice control method. include:
步骤01:接收当前轮次语音指令,接收图形用户界面信息,融合图形用户界面信息和历史轮次的语音对话信息以生成动态场景;Step 01: Receive the voice command of the current round, receive the GUI information, fuse the GUI information and the voice dialogue information of the previous round to generate a dynamic scene;
步骤02:根据动态场景生成场景语义文档;Step 02: Generate scene semantic documents according to dynamic scenes;
步骤03:根据场景语义文档确定当前轮次语音指令对应的语义理解;Step 03: Determine the semantic understanding corresponding to the voice command of the current round according to the scene semantic document;
步骤04:根据当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果;Step 04: Determine the semantic understanding result according to the semantic understanding corresponding to the current round of voice commands or the global semantic understanding;
步骤05:根据语义理解结果控制车辆1000执行相应操作。Step 05: Control the vehicle 1000 to perform corresponding operations according to the semantic understanding result.
请参阅图2,本申请实施方式的语音控制装置100包括第一生成模块10、第二生成模块20、第一确定模块30、第二确定模块40和控制模块50。本申请的语音控制方法可以由本申请实施方式的语音控制装置100实现,其中,步骤01可以由第一生成模块10实现,步骤02可以由第二生成模块20实现,步骤03可以由第一确定模块30实现,步骤04可以由第二确定模块40实现,步骤05可以由控制模块50实现,也即是说,第一生成模块10用于接收当前轮次语音指令,接收图形用户界面信息,融合图形用户界面信息和历史轮次的语音对话信息以生成动态场景。第二生成模块20用于根据动态场景生成场景语义文档。第一确定模块30用于根据场景语义文档确定当前轮次语音指令对应的语义理解。第二确定模块40用于根据当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果。控制模块50用于根据语义理解结果控制车辆1000执行相应操作。Referring to FIG. 2 , the voice control device 100 according to the embodiment of the present application includes a first generation module 10 , a second generation module 20 , a first determination module 30 , a second determination module 40 and a control module 50 . The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein, step 01 can be realized by the first generating module 10, step 02 can be realized by the second generating module 20, and step 03 can be realized by the first determining module 30, step 04 can be realized by the second determination module 40, and step 05 can be realized by the control module 50, that is to say, the first generation module 10 is used to receive the current round of voice instructions, receive the graphical user interface information, and fuse the graphics User interface information and voice dialogue information of historical rounds to generate dynamic scenes. The second generating module 20 is used for generating scene semantic documents according to dynamic scenes. The first determining module 30 is configured to determine the semantic understanding corresponding to the voice command of the current round according to the scene semantic document. The second determination module 40 is configured to determine the semantic understanding result according to the semantic understanding corresponding to the current round of voice instruction or the global semantic understanding. The control module 50 is used for controlling the vehicle 1000 to perform corresponding operations according to the result of semantic understanding.
请一并参阅图3和图4,本申请实施方式的服务器500包括存储器200和处理器300。服务器500用于控制车辆1000。本申请实施方式的语音控制方法可以由本申请实施方式的服务器500实现。服务器500可以包括系统端,存储器200中存储有计算机程序,计算机程序被处理器300执行时,实现上述语音控制方法。其中,步骤01、步骤02、步骤03、步骤04和步骤05均可以由处理器300实现,也即是说,处理器300可用于:接收当前轮次语音指令,接收图形用户界面信息,融合图形用户界面信息和历史轮次的语音对话信息以生成动态场景;根据动态场景生成场景语义文档;根据场景语义文档确定当前轮次语音指令对应的语义理解;根据当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果;根据语义理解结果控制车辆1000执行相应操作。Please refer to FIG. 3 and FIG. 4 together. The server 500 in the embodiment of the present application includes a memory 200 and a processor 300 . Server 500 is used to control vehicle 1000 . The voice control method in the embodiment of the present application may be realized by the server 500 in the embodiment of the present application. The server 500 may include a system end, and a computer program is stored in the memory 200. When the computer program is executed by the processor 300, the above-mentioned voice control method is implemented. Among them, step 01, step 02, step 03, step 04 and step 05 can all be implemented by the processor 300, that is to say, the processor 300 can be used to: receive the current round of voice commands, receive graphical user interface information, and integrate graphics User interface information and historical rounds of voice dialogue information to generate dynamic scenes; generate scene semantic documents according to dynamic scenes; determine the semantic understanding corresponding to the current round of voice commands according to the scene semantic documents; The global semantic understanding determines the semantic understanding result; the vehicle 1000 is controlled to perform corresponding operations according to the semantic understanding result.
处理器300可以包括驱动板。驱动板可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。The processor 300 may include a driver board. The driver board can be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
具体地,历史轮次的语音对话信息包括用户与系统的历史对话信息,当前轮次语音指令可以是用户的一个动作,图形用户界面信息(Graphical User Interface,GUI)包括车辆1000上运行的车载系统使用图形用户界面,用于为用户呈现展示的内容。Specifically, the voice dialogue information of the historical rounds includes the historical dialogue information between the user and the system, the voice command of the current round can be an action of the user, and the graphical user interface information (Graphical User Interface, GUI) includes the on-board system running on the vehicle 1000 Use a graphical user interface for presenting displayed content to users.
在一个实施例中,接收当前轮次语音指令,前轮次语音指令可以是:用户发出的指令“确定关闭”。在接收到“确定关闭”这一语音指令的情况下,同时接收图形用户界面信息,历史轮次的语音对话信息,包括:用户发出的指令“关闭低速模拟音”和系统进行确认“低速模拟音能提示行人,降低安全风险,确定关闭吗?”这两句历史轮次的语音对话信息。融合图形用户界面信息和历史轮次的语音对话信息以生成动态场景。In an embodiment, the voice command of the current round is received, and the voice command of the previous round may be: the command "OK to close" issued by the user. In the case of receiving the voice command of "OK to close", at the same time receive the graphical user interface information, the voice dialogue information of the historical rounds, including: the user's command "turn off the low-speed analog tone" and the system confirms "low-speed analog tone Can remind pedestrians, reduce safety risks, are you sure to close it?" These two historical rounds of voice dialogue information. Fusion of GUI information and historical rounds of voice dialogue information to generate dynamic scenes.
值得一提的是,在某些实施方式中,语音控制装置100可以用于控制车辆1000,车辆1000包括显示区域、电声元件、通信元件和处理器等。车辆1000的显示区域可以包括仪表屏、车端大屏以及车辆1000挡风玻璃上可以实现的抬头显示等,本申请中的车辆1000的显示区域以车载大屏为例进行解释说明,此 处不作限定。具体地,请参阅图5,车辆1000包括动态场景生成器和车端大屏。车端大屏能够接收用户请求,也能够将系统生成的回复展示给用户。展示的方式包括显示展示和语音展示,此处不作限定。车端大屏可以将接收到的用户请求进行自然语言理解,同时将车端大屏上的图形用户界面信息传递至动态场景生成器,动态场景生成器能够结合图形用户界面信息和历史轮次的语音对话信息以生成动态场景。It is worth mentioning that, in some implementations, the voice control device 100 can be used to control the vehicle 1000, and the vehicle 1000 includes a display area, an electro-acoustic element, a communication element, a processor, and the like. The display area of the vehicle 1000 may include the instrument screen, the large screen at the end of the vehicle, and the head-up display that can be realized on the windshield of the vehicle 1000. limited. Specifically, referring to FIG. 5 , the vehicle 1000 includes a dynamic scene generator and a large screen at the vehicle end. The large screen at the car end can receive user requests, and can also display the responses generated by the system to users. The presentation methods include display presentation and voice presentation, which are not limited here. The large screen at the car end can understand the received user request in natural language, and at the same time pass the graphical user interface information on the large screen at the car end to the dynamic scene generator. The dynamic scene generator can combine the graphical user interface information and the history of rounds Voice dialogue information to generate dynamic scenes.
动态场景能够生成场景语义文档,场景语义文档可以理解成一个可以搜索的空间,场景语义文档包括多个语义理解,如此可以根据场景语义文档查询当前轮次语音指令对应的语义理解。值得一提的是,在某些实施方式中,动态场景生成的场景语义文档中,在查询不到当前轮次语音指令对应的语义理解的情况下,可以结合全局的信息生成全局语义理解,也即是说,当前轮次语音指令对应的语义理解和全局语义理解都能够得到语义理解结果。不同地是,当前轮次语音指令对应的语义理解是根据场景语义文档搜索或其他方式确定得到的,而全局语义理解是在场景语义文档中搜索不到的。当前轮次语音指令对应的语义理解和全局语义理解都是语义理解结果,如此可以根据语义理解结果控制车辆1000执行相应操作。Dynamic scenes can generate scene semantic documents. The scene semantic documents can be understood as a searchable space. The scene semantic documents include multiple semantic understandings. In this way, the semantic understanding corresponding to the current round of voice commands can be queried according to the scene semantic documents. It is worth mentioning that, in some implementations, in the scene semantic document generated by the dynamic scene, if the semantic understanding corresponding to the voice command of the current round cannot be found, the global semantic understanding can be generated by combining the global information. That is to say, both the semantic understanding corresponding to the current round of voice commands and the global semantic understanding can obtain semantic understanding results. The difference is that the semantic understanding corresponding to the voice command of the current round is determined according to the scene semantic document search or other methods, while the global semantic understanding cannot be searched in the scene semantic document. Both the semantic understanding and global semantic understanding corresponding to the voice command of the current round are semantic understanding results, so that the vehicle 1000 can be controlled to perform corresponding operations according to the semantic understanding results.
具体地,在前轮次语音指令对应的语义理解确定的语义理解结果情况下,可以对历史轮次的语音对话信息进行更新,然后控制车辆1000执行相应操作,操作可以是“打开车窗”、“关闭导航”和“开启音乐界面”等动作,此处不作限定。在全局语义理解确定的语义理解结果情况下,不对历史轮次的语音对话信息进行更新,若接收全局语义理解确定的语义理解结果,则可以发起新一轮的对话任务。Specifically, in the case of the semantic understanding result determined by the semantic understanding corresponding to the voice command of the previous round, the voice dialogue information of the previous round can be updated, and then the vehicle 1000 is controlled to perform the corresponding operation. The operation can be "open the window", Actions such as "close navigation" and "open music interface" are not limited here. In the case of the semantic understanding result determined by the global semantic understanding, the voice dialogue information of the historical round is not updated. If the semantic understanding result determined by the global semantic understanding is received, a new round of dialogue tasks can be initiated.
本申请实施方式的语音控制方法、语音控制装置100和服务器500,能够融合图形用户界面信息和历史轮次的语音对话信息以生成动态场景,根据动态场景生成场景语义文档,根据场景语义文档能够限制任务内的语义理解过程,对多轮状态的语音进行管理,从而提高这个垂域上多轮对话语义理解的精度。The voice control method, the voice control device 100 and the server 500 of the embodiment of the present application can fuse the graphical user interface information and the voice dialogue information of the historical rounds to generate a dynamic scene, generate a scene semantic document according to the dynamic scene, and limit the The semantic understanding process within the task manages the speech in multiple rounds, so as to improve the accuracy of semantic understanding of multiple rounds of dialogue in this vertical domain.
请参阅图6,在某些实施方式中,步骤01包括步骤:Referring to Figure 6, in some embodiments, step 01 includes the steps of:
步骤012:在接收到当前轮次语音指令的情况下,根据历史轮次的语音对话信息确定语义空间,语义空间用于表征当前轮次语音指令的语义理解指向;Step 012: In the case of receiving the voice command of the current round, determine the semantic space according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the voice command of the current round;
步骤014:根据语义空间和图形用户界面信息确定动态场景。Step 014: Determine the dynamic scene according to the semantic space and GUI information.
在某些实施方式中语音控制装置100包括第三确定模块,第三确定模块包括第一确定子单元和第二确定子单元。本申请的语音控制方法可以由本申请实施方式的语音控制装置100实现,其中步骤012可以由第一确定子单元实现,步骤014可以由第二确定子单元实现,也即是说,第一确定子单元用于在接收到当前轮次语音指令的情况下,根据历史轮次的语音对话信息确定语义空间,语义空间用于表征当前轮次语音指令的语义理解指向。第二确定子单元用于根据语义空间和图形用户界面信息确定动态场景。In some implementations, the voice control device 100 includes a third determining module, and the third determining module includes a first determining subunit and a second determining subunit. The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 012 can be realized by the first determination subunit, and step 014 can be realized by the second determination subunit, that is to say, the first determination subunit The unit is used to determine the semantic space according to the voice dialogue information of the historical round when the voice command of the current round is received, and the semantic space is used to represent the semantic understanding direction of the voice command of the current round. The second determining subunit is used for determining the dynamic scene according to the semantic space and the GUI information.
在某些实施方式中,本申请实施方式的语音控制方法可以由本申请实施方式的服务器500实现,其中,步骤012和步骤014均可以由处理器300实现,也即是说,处理器300可用于:在接收到当前轮次语音指令的情况下,根据历史轮次的语音对话信息确定语义空间,语义空间用于表征当前轮次语音指令的语义理解指向;根据语义空间和图形用户界面信息确定动态场景。In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 012 and step 014 can be implemented by the processor 300, that is to say, the processor 300 can be used to : In the case of receiving the voice command of the current round, determine the semantic space according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the voice command of the current round; determine the dynamic according to the semantic space and GUI information Scenes.
具体地,语义空间是历史轮次的语音对话信息确定的,语义空间用于表征当前轮次语音指令的语义 理解指向。语义空间可以理解为一定的语义范围。语义空间可以包括静态语义空间和动态语义空间。Specifically, the semantic space is determined by the speech dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the current round of speech instructions. Semantic space can be understood as a certain semantic range. Semantic space can include static semantic space and dynamic semantic space.
请参阅图7,图7中包括一个对话系统轮(即一个垂域上的多轮对话),也就是对话一。图7中的对话系统轮在向用户询问是否了解操作带来的安全风险。若用户确认则执行相应操作。下一轮的潜在语义空间为图7。请参阅图8,图8中也包括一个对话系统轮(即一个垂域上的多轮对话),也就是对话二。图8中的对话系统轮在向用户询问是否了解操作带来的后果。若用户确认则执行相应操作。下一轮的潜在语义空间为图8。如此,根据理解可知,两段对话在确认轮(下一轮)的潜在对话动作信息相同,但潜在语义空间不同。并且两者的潜在的语义空间在系统确认轮(下一轮)就已经可以确定,即为静态语义空间。Please refer to FIG. 7, which includes a dialog system round (ie, multiple rounds of dialog on a vertical domain), that is, dialog one. The dialogue system in Figure 7 is asking the user whether he understands the security risks brought by the operation. If the user confirms, perform the corresponding action. The latent semantic space for the next round is Figure 7. Please refer to Fig. 8, which also includes a dialog system round (ie, multiple rounds of dialog on a vertical domain), that is, dialog 2. The dialogue system in Figure 8 is in turn asking the user if he understands the consequences of the action. If the user confirms, perform the corresponding action. The latent semantic space for the next round is Figure 8. In this way, according to the understanding, the potential dialogue action information of the two dialogues in the confirmation round (the next round) is the same, but the latent semantic space is different. And the potential semantic space of the two can be determined in the system confirmation round (the next round), that is, the static semantic space.
请参阅图9,图9中包括一个对话系统轮(即一个垂域上的多轮对话),也就是对话三。图9中的对话系统轮在向用户询问选择结果。在用户回复轮(下一轮)的潜在语义空间不能在系统询问轮确定,潜在的语义空间为图9,因此是动态语义空间。Please refer to Fig. 9, which includes a dialog system round (ie, multiple rounds of dialog on a vertical domain), that is, dialog three. The dialog system in Fig. 9 is asking the user for the result of the selection in turn. The latent semantic space in the user reply round (next round) cannot be determined in the system inquiry round, the latent semantic space is shown in Figure 9, so it is a dynamic semantic space.
静态语义空间可以理解为:语义空间内的回复不依赖于于时间、场景、空间和用户等多种因素,而动态语义空间则具有很多变量。在一个实施例中,当前用户的所在地点不同,会导致语义空间内的内容不同。例如:用户在中关村说导航去北京大学,与用户在深圳说导航去北京大学,形成的可选路线的列表是不同的,如此动态语义空间是根据用户所在的地域的不同而会发生变化的。The static semantic space can be understood as: the reply in the semantic space does not depend on various factors such as time, scene, space and user, while the dynamic semantic space has many variables. In one embodiment, different locations of the current user will result in different content in the semantic space. For example, when a user says to navigate to Peking University in Zhongguancun, and a user says to navigate to Peking University in Shenzhen, the list of optional routes formed is different, so the dynamic semantic space will change according to the region where the user is located.
在某些实施方式中,动态场景可以理解为:将语义空间转为一种可读的树状结构,保留所有语义空间中的信息。如此,可以根据语义空间和图形用户界面信息确定动态场景。In some implementations, the dynamic scene can be understood as: converting the semantic space into a readable tree structure, and retaining all information in the semantic space. In this way, dynamic scenes can be determined based on semantic space and GUI information.
请参阅图10,在某些实施方式中,步骤01包括步骤:Referring to Figure 10, in some embodiments, step 01 includes the steps of:
步骤016:在接收到当前轮次语音指令的情况下,载入并解析历史轮次的语音对话信息中包括的动态场景元素;Step 016: In the case of receiving the voice command of the current round, load and analyze the dynamic scene elements included in the voice dialogue information of the previous round;
步骤018:根据动态场景元素、历史轮次的语音对话信息生成动态场景。Step 018: Generate a dynamic scene according to the dynamic scene elements and the voice dialogue information of the historical rounds.
在某些实施方式中语音控制装置100包括第一处理模块和第二生成模块,本申请的语音控制方法可以由本申请实施方式的语音控制装置100实现,其中步骤016可以由第一处理模块实现,步骤018可以由第二生成模块实现,也即是说,第一处理模块用于在接收到当前轮次语音指令的情况下,载入并解析历史轮次的语音对话信息中包括的动态场景元素。第二生成模块用于根据动态场景元素、历史轮次的语音对话信息生成动态场景。In some implementations, the voice control device 100 includes a first processing module and a second generation module, the voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 016 can be realized by the first processing module, Step 018 can be realized by the second generation module, that is to say, the first processing module is used to load and analyze the dynamic scene elements included in the voice dialogue information of the previous round when receiving the voice command of the current round . The second generation module is used to generate dynamic scenes according to dynamic scene elements and voice dialogue information of historical rounds.
在某些实施方式中,本申请实施方式的语音控制方法可以由本申请实施方式的服务器500实现,其中,步骤016和步骤018均可以由处理器300实现,也即是说,处理器300可用于:在接收到当前轮次语音指令的情况下,载入并解析历史轮次的语音对话信息中包括的动态场景元素;根据动态场景元素、历史轮次的语音对话信息生成动态场景。In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 016 and step 018 can be implemented by the processor 300, that is to say, the processor 300 can be used to : In the case of receiving the voice command of the current round, load and analyze the dynamic scene elements included in the voice dialogue information of the historical rounds; generate the dynamic scene according to the dynamic scene elements and the voice dialogue information of the historical rounds.
具体地,动态场景元素具有不同的呈现类型,例如按钮、滑块、状态按钮、文本输入框、复选框、单选按钮、群组按钮、开关按钮、视图、组、对话框以及用于进行交互可操作的控件等。在某些实施方 式中,还可以获得标签,标签包括对话动作信息和/或槽位参数。如此可以载入并解析历史轮次的语音对话信息中包括的动态场景元素,根据动态场景元素、历史轮次的语音对话信息生成动态场景。Specifically, dynamic scene elements have different presentation types, such as buttons, sliders, status buttons, text input boxes, checkboxes, radio buttons, group buttons, switch buttons, views, groups, dialog boxes, and Interactive and operable controls, etc. In some implementations, tags can also be obtained, and the tags include dialogue action information and/or slot parameters. In this way, the dynamic scene elements included in the voice dialogue information of the historical rounds can be loaded and analyzed, and the dynamic scene can be generated according to the dynamic scene elements and the voice dialogue information of the historical rounds.
在某些实施方式中,场景语义文档的文档数据与动态场景元素的相似度大于相似度阈值。In some implementations, the similarity between the document data of the scene semantic document and the dynamic scene element is greater than a similarity threshold.
具体地,场景语义文档包括多个文档数据,多个文档数据与动态场景元素的相似度均大于相似度阈值。如此,可以根据动态场景元素的相似度阈值来确定某个文档数据是否为场景语义文档的文档数据。若相似度小于相似度阈值,则认为此文档数据不属于场景语义文档的文档数据;若相似度大于或等于相似度阈值,则认为此文档数据属于场景语义文档的文档数据。值得一提的是,还可以通过其他的方式确定场景语义文档的文档数据,例如:模板匹配、句子相似度计算、模型阅读理解等方式,此处不作限定。Specifically, the scene semantic document includes multiple document data, and the similarities between the multiple document data and the dynamic scene elements are all greater than a similarity threshold. In this way, whether a certain document data is the document data of the scene semantic document can be determined according to the similarity threshold of the dynamic scene element. If the similarity is less than the similarity threshold, it is considered that the document data does not belong to the document data of the scene semantic document; if the similarity is greater than or equal to the similarity threshold, the document data is considered to belong to the document data of the scene semantic document. It is worth mentioning that other methods can also be used to determine the document data of the scene semantic document, such as: template matching, sentence similarity calculation, model reading comprehension, etc., which are not limited here.
在一个实施例中,动态场景的生成过程包括:加载历史轮次的语音对话信息的对话状态信息,也包括槽位参数、系统对话动作信息等信息;再根据系统对话动作信息,推理潜在的用户对话动作信息;最后通过近义词泛化,泛化槽位参数、对话动作信息的标签等。In one embodiment, the generation process of the dynamic scene includes: loading the dialogue state information of the voice dialogue information of the historical rounds, including slot parameters, system dialogue action information and other information; and then inferring potential users according to the system dialogue action information Dialogue action information; finally through the generalization of synonyms, generalize slot parameters, labels of dialogue action information, etc.
请参阅图11,图11中包括用户动作和系统动作。用户动作包括:通知、取消、确定、否认和询问更多等动作;系统动作包括:询问、选择、确认、引导、否认、成功和失败等动作。图11中的列表可以根据对话状态信息进行更新。图11中包括多个1和0,1可以认为系统与用户之间的对话是相关联的。例如:上一轮为系统的一个询问动作,下一轮为用户的一个回复动作,若系统的询问动作与用户的回复动作相关联则认为是1;若系统的询问动作与用户的回复动作无关联则认为是0。在一个例子中,系统询问用户:“是否关闭车窗”,用户回答:“确定关闭车窗”,可以判定这两句对话相关联,具有上下文关系,在表格中记录为1。在另一个例子中,系统询问用户:“是否关闭车窗”,用户回答:“天气真好啊”,可以判定这两句对话无关联,不具有上下文关系,在表格中记录为0。当记录为0的情况下,系统可以认为用户的回复为错误回复,可以当成噪音进行处理,也可以再次询问用户,例如:“您说什么呢?我再给您描述一下问题,是否关闭车窗”,如此可以证明对话的连续性以便于生成动态场景。Please refer to FIG. 11 , which includes user actions and system actions. User actions include: notify, cancel, confirm, deny, and ask for more actions; system actions include: ask, select, confirm, guide, deny, succeed, and fail. The list in Figure 11 can be updated according to the dialog state information. Figure 11 includes multiple 1s and 0s, 1 can be considered as the dialogue between the system and the user is associated. For example: the last round is an inquiry action of the system, and the next round is a reply action of the user, if the inquiry action of the system is associated with the reply action of the user, it is considered as 1; if the inquiry action of the system is not related to the reply action of the user Correlation is considered to be 0. In one example, the system asks the user: "Do you want to close the window?" The user answers: "OK to close the window." It can be determined that the two sentences are related and have a contextual relationship, which is recorded as 1 in the table. In another example, the system asks the user: "Do you want to close the car window?" and the user answers: "The weather is so nice." It can be determined that the two sentences are irrelevant and have no contextual relationship, and are recorded as 0 in the table. When the record is 0, the system can consider the user's reply as an error reply, which can be treated as noise, or the user can be asked again, for example: "What did you say? Let me describe the problem to you again, whether to close the car window ”, so that the continuity of the dialogue can be proved to facilitate the generation of dynamic scenes.
请参阅图12,在一个实施例中,用户可以说出指令:“导航去星巴克”,其中槽位参数包括目的地和星巴克,对话动作信息包括用户和通知。系统可以回复:“为您找到多个星巴克,要去哪个?”,其中槽位参数包括目的地搜索结果(即为北大北门星巴克、星巴克中关村店、星巴克星享店),对话动作信息包括系统和选择。用户回复系统:“北大北边那个”,车载大屏可以显示目的地搜索结果(即为北大北门星巴克、星巴克中关村店、星巴克星享店),槽位参数包括目的地和北大北门星巴克,对话动作信息包括用户和通知,值得一提的是,车载大屏还可以显示多种其他操作,例如:退出、重新导航等。Please refer to FIG. 12 , in one embodiment, the user can speak an instruction: "Navigate to Starbucks", wherein the slot parameters include the destination and Starbucks, and the dialog action information includes the user and the notification. The system can reply: "I found multiple Starbucks for you, which one do you want to go to?", where the slot parameters include the destination search results (i.e. Starbucks at the North Gate of Peking University, Starbucks Zhongguancun Store, and Starbucks Starbucks Store), and the dialog action information includes the system and select. The user replied to the system: "The one on the north side of Peking University". The large screen in the car can display the search results of the destination (that is, Starbucks at the North Gate of Peking University, Starbucks Zhongguancun Store, and Starbucks Starbucks Store). The slot parameters include the destination and Starbucks at the North Gate of Peking University. Dialogue Action information includes users and notifications. It is worth mentioning that the large screen in the car can also display a variety of other operations, such as: exit, re-navigate, etc.
请参阅图13,在某些实施方式中,步骤04包括步骤:Referring to Figure 13, in some embodiments, step 04 includes the steps of:
步骤041:利用当前轮次语音指令对应的语义理解搜索数据库;Step 041: Use the semantic understanding corresponding to the current round of voice commands to search the database;
步骤042:在搜索结果存在与当前轮次语音指令对应的语义理解相匹配的结果时,将当前轮次语音指令对应的语义理解作为语义理解结果;Step 042: When there is a search result that matches the semantic understanding corresponding to the voice command of the current round, take the semantic understanding corresponding to the voice command of the current round as the semantic understanding result;
步骤043:在搜索结果不存在与当前轮次语音指令对应的语义理解相匹配的结果时,将全局语义理解 作为语义理解结果。Step 043: When there is no search result that matches the semantic understanding corresponding to the voice command of the current round, use the global semantic understanding as the semantic understanding result.
在某些实施方式中语音控制装置100包括第二处理模块、第三处理模块和第四处理模块,本申请的语音控制方法可以由本申请实施方式的语音控制装置100实现,其中步骤041可以由第二处理模块实现,步骤042可以由第三处理模块实现,步骤043可以由第四处理模块实现,也即是说,第二处理模块用于利用当前轮次语音指令对应的语义理解搜索数据库。第三处理模块用于在搜索结果存在与当前轮次语音指令对应的语义理解相匹配的结果时,将当前轮次语音指令对应的语义理解作为语义理解结果。第四处理模块用于在搜索结果不存在与当前轮次语音指令对应的语义理解相匹配的结果时,将全局语义理解作为语义理解结果。In some implementations, the voice control device 100 includes a second processing module, a third processing module, and a fourth processing module. The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 041 can be implemented by the first The second processing module is implemented, step 042 can be implemented by the third processing module, and step 043 can be implemented by the fourth processing module, that is to say, the second processing module is used to use the semantic understanding corresponding to the current round of voice commands to search the database. The third processing module is configured to take the semantic understanding corresponding to the voice command of the current round as the semantic understanding result when there is a search result matching the semantic understanding corresponding to the voice command of the current round. The fourth processing module is configured to use the global semantic understanding as the semantic understanding result when there is no search result that matches the semantic understanding corresponding to the voice command of the current round.
在某些实施方式中,本申请实施方式的语音控制方法可以由本申请实施方式的服务器500实现,其中,步骤041、步骤042和步骤043均可以由处理器300实现,也即是说,处理器300可用于:利用当前轮次语音指令对应的语义理解搜索数据库;在搜索结果存在与当前轮次语音指令对应的语义理解相匹配的结果时,将当前轮次语音指令对应的语义理解作为语义理解结果;在搜索结果不存在与当前轮次语音指令对应的语义理解相匹配的结果时,将全局语义理解作为语义理解结果。In some embodiments, the voice control method in the embodiment of the present application can be implemented by the server 500 in the embodiment of the application, wherein, step 041, step 042 and step 043 can all be implemented by the processor 300, that is to say, the processor 300 may be used to: use the semantic understanding corresponding to the current round of voice commands to search the database; when there is a search result that matches the semantic understanding corresponding to the current round of voice commands, use the semantic understanding corresponding to the current round of voice commands as the semantic understanding Result; when there is no search result matching the semantic understanding corresponding to the current round of voice instructions, the global semantic understanding is taken as the semantic understanding result.
具体地,数据库中记录有多轮对话的历史数据,例如:包括历史对话的上下文内容、对话轮数、任务树形图等信息。在一个实施方式中,如图5所示,数据库包括上下文存储器,当前轮次语音指令上传的同时,实时上传图形用户界面信息至上下文存储器中。在某些实施方式中,可以在进行语义理解时,可以根据数据库中的相关进行自然语言理解。Specifically, the database records the historical data of multiple rounds of dialogues, for example, including the context content of the historical dialogues, number of dialogue rounds, task tree diagram and other information. In one embodiment, as shown in FIG. 5 , the database includes a context memory, and the GUI information is uploaded to the context memory in real time while the current round of voice commands is uploaded. In some implementations, when performing semantic understanding, natural language understanding can be performed according to correlations in the database.
在数据库进行搜索后,若存在与当前轮次语音指令对应的语义理解相匹配的结果时,将当前轮次语音指令对应的语义理解作为语义理解结果。After the database is searched, if there is a result matching the semantic understanding corresponding to the voice command of the current round, the semantic understanding corresponding to the voice command of the current round is taken as the semantic understanding result.
在数据库进行搜索后,若不存在与当前轮次语音指令对应的语义理解相匹配的结果时,将全局语义理解作为语义理解结果。After the database is searched, if there is no result matching the semantic understanding corresponding to the current round of voice commands, the global semantic understanding is taken as the semantic understanding result.
也就是说,语义理解结果包括两种:一种是将当前轮次语音指令对应的语义理解作为语义理解结果,另一种是将全局语义理解作为语义理解结果。That is to say, there are two kinds of semantic understanding results: one is to use the semantic understanding corresponding to the voice command of the current round as the semantic understanding result, and the other is to use the global semantic understanding as the semantic understanding result.
值得一提的是,数据库中的数据会在每一轮对话的过程中进行更新。其中,更新的依据可包括但不限于语义理解结果、历史对话状态信息等,此处不作限定。It is worth mentioning that the data in the database will be updated during each round of dialogue. Wherein, the basis for updating may include but not limited to semantic understanding results, historical dialogue state information, etc., which are not limited here.
请参阅图14,在某些实施方式中,步骤05包括步骤:Referring to Figure 14, in some embodiments, step 05 includes the steps of:
步骤051:在当前轮次语音指令对应的语义理解作为语义理解结果时,对历史轮次的语音对话信息进行更新,并发送操作指令以使车辆1000执行相应操作;Step 051: When the semantic understanding corresponding to the voice command of the current round is taken as the semantic understanding result, update the voice dialogue information of the previous round, and send an operation command to make the vehicle 1000 perform a corresponding operation;
步骤052:在全局语义理解作为语义理解结果时,控制车辆1000发起新一轮对话任务。Step 052: When the global semantic understanding is the semantic understanding result, the control vehicle 1000 initiates a new round of dialogue tasks.
在某些实施方式中语音控制装置100包括第五处理模块和第六处理模块,本申请的语音控制方法可以由本申请实施方式的语音控制装置100实现,其中步骤051可以由第五处理模块实现,步骤052可以由第六处理模块实现,也即是说,第五处理模块用于在当前轮次语音指令对应的语义理解作为语义理解 结果时,对历史轮次的语音对话信息进行更新,并发送操作指令以使车辆1000执行相应操作。第六处理模块用于在全局语义理解作为语义理解结果时,控制车辆1000发起新一轮对话任务。In some embodiments, the voice control device 100 includes a fifth processing module and a sixth processing module. The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 051 can be realized by the fifth processing module, Step 052 can be realized by the sixth processing module, that is to say, the fifth processing module is used to update the voice dialogue information of the historical round when the semantic understanding corresponding to the voice command of the current round is taken as the semantic understanding result, and send The operation instructions cause the vehicle 1000 to perform corresponding operations. The sixth processing module is used to control the vehicle 1000 to initiate a new round of dialog tasks when the global semantic understanding is the semantic understanding result.
在某些实施方式中,本申请实施方式的语音控制方法可以由本申请实施方式的服务器500实现,其中,步骤051和步骤052均可以由处理器300实现,也即是说,处理器300可用于:在当前轮次语音指令对应的语义理解作为语义理解结果时,对历史轮次的语音对话信息进行更新,并发送操作指令以使车辆1000执行相应操作;在全局语义理解作为语义理解结果时,控制车辆1000发起新一轮对话任务。In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 051 and step 052 can be implemented by the processor 300, that is to say, the processor 300 can be used to : When the semantic understanding corresponding to the current round of voice commands is taken as the semantic understanding result, the voice dialogue information of the previous round is updated, and the operation command is sent to enable the vehicle 1000 to perform the corresponding operation; when the global semantic understanding is taken as the semantic understanding result, The control vehicle 1000 initiates a new round of dialog tasks.
具体地,在当前轮次语音指令对应的语义理解作为语义理解结果时,可以对历史轮次的语音对话信息进行更新,更新过程可以通过对话状态信息更新模块和对话策略优化模块实现。在某些实施方式中,对话状态信息更新模块和对话策略优化模块可以合并在一起,即为对话管理模块。更新对话信息包括更新对话动作信息与对话状态信息。更新优化后以生成回复信息(操作指令),如此可以发送操作指令以使车辆1000执行相应操作。Specifically, when the semantic understanding corresponding to the voice command of the current round is taken as the semantic understanding result, the voice dialogue information of the previous round can be updated, and the updating process can be realized by the dialogue state information update module and the dialogue strategy optimization module. In some implementations, the dialogue state information update module and the dialogue policy optimization module can be combined together, that is, the dialogue management module. Updating dialog information includes updating dialog action information and dialog state information. After updating and optimization, reply information (operation instruction) can be generated, so that the operation instruction can be sent to make the vehicle 1000 perform the corresponding operation.
请参阅图15,在某些实施方式中,步骤051包括步骤:Referring to Figure 15, in some embodiments, step 051 includes the steps of:
步骤0511:根据历史轮次的语音对话信息,查询用户输出的对话动作信息和系统输出的对话动作信息,以获取用户槽位参数和系统槽位参数;Step 0511: Query the dialogue action information output by the user and the dialogue action information output by the system according to the voice dialogue information of the historical rounds, so as to obtain the user slot parameters and the system slot parameters;
步骤0512:利用用户槽位参数和系统槽位参数执行槽位动作,更新可信槽位参数,以更新对话状态信息。Step 0512: Use the user slot parameters and the system slot parameters to perform slot actions, update the trusted slot parameters, and update the dialog state information.
在某些实施方式中语音控制装置100包括第七处理模块和第八处理模块,本申请的语音控制方法可以由本申请实施方式的语音控制装置100实现,其中步骤0511可以由第七处理模块实现,步骤0512可以由第八处理模块实现,也即是说,第七处理模块用于根据历史轮次的语音对话信息,查询用户输出的对话动作信息和系统输出的对话动作信息,以获取用户槽位参数和系统槽位参数。第八处理模块用于利用用户槽位参数和系统槽位参数执行槽位动作,更新可信槽位参数,以更新对话状态信息。In some implementations, the voice control device 100 includes a seventh processing module and an eighth processing module. The voice control method of the present application can be realized by the voice control device 100 of the embodiment of the present application, wherein step 0511 can be realized by the seventh processing module, Step 0512 can be implemented by the eighth processing module, that is to say, the seventh processing module is used to query the dialogue action information output by the user and the dialogue action information output by the system according to the voice dialogue information of the historical rounds, so as to obtain the user slot parameters and system slot parameters. The eighth processing module is used to execute slot actions by using user slot parameters and system slot parameters, and update trusted slot parameters to update dialog state information.
在某些实施方式中,本申请实施方式的语音控制方法可以由本申请实施方式的服务器500实现,其中,步骤0511和步骤0512均可以由处理器300实现,也即是说,处理器300可用于:根据历史轮次的语音对话信息,查询用户输出的对话动作信息和系统输出的对话动作信息,以获取用户槽位参数和系统槽位参数;利用用户槽位参数和系统槽位参数执行槽位动作,更新可信槽位参数,以更新对话状态信息。In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein both step 0511 and step 0512 can be implemented by the processor 300, that is to say, the processor 300 can be used to : According to the voice dialogue information of historical rounds, query the dialogue action information output by the user and the dialogue action information output by the system to obtain the user slot parameters and system slot parameters; use the user slot parameters and system slot parameters to execute the slot Action, update the trusted slot parameter to update the dialog state information.
具体地,请参阅图16,图16中包括一组多轮对话,可以根据用户的对话动作信息和系统的对话动作信息,获取用户槽位参数和系统槽位参数,同时也可以获取任务参数。用户槽位参数和系统槽位参数行槽位动作以更新可信槽位参数,从而更新每个任务中的对话状态信息。在某些实施方式中,用户槽位参数是指用户每一轮请求的槽位参数,系统询问槽位是指系统需询问、选择、确认的槽位参数或候选的槽位参数,可信槽位是指最终输出的槽位参数。在某些实施方式中,执行槽位动作包括延续动作、删除动作、更新动作和失效动作中的至少一种。Specifically, please refer to FIG. 16. FIG. 16 includes a group of multi-round dialogues, and user slot parameters and system slot parameters can be obtained according to user dialogue action information and system dialogue action information, and task parameters can also be acquired. User Slot Parameters and System Slot Parameters row slot actions to update trusted slot parameters, thereby updating dialogue state information in each task. In some embodiments, the user slot parameter refers to the slot parameter requested by the user in each round, the system query slot refers to the slot parameter or candidate slot parameter that the system needs to inquire, select, and confirm, and the trusted slot Bit refers to the slot parameter of the final output. In some embodiments, executing the slot action includes at least one of a continuation action, a delete action, an update action, and an invalidation action.
具体地,执行槽位动作包括但不仅限于延续动作、删除动作、更新动作和失效动作。延续动作即为 槽位参数与上一轮的槽位参数相同,在当前轮次中不对槽位参数进行更新。删除动作即为删除已有的槽位参数。更新动作即为更新已有的槽位参数。失效动作即为任务相关的部分槽位参数在后续的对话中不再关心。Specifically, the slot execution actions include, but are not limited to, continuation actions, deletion actions, update actions, and invalidation actions. The continuation action is that the slot parameters are the same as those of the previous round, and the slot parameters are not updated in the current round. The delete action is to delete the existing slot parameters. The update action is to update the existing slot parameters. The invalidation action means that some slot parameters related to the task are no longer concerned in subsequent dialogues.
请参阅图17,在某些实施方式中,步骤051包括步骤:Referring to Figure 17, in some embodiments, step 051 includes the steps of:
步骤0513:判断动态场景中多个场景页面的优先级顺序;Step 0513: Determine the priority sequence of multiple scene pages in the dynamic scene;
步骤0514:根据多个场景页面的优先级顺序将高优先级的场景页面节点压入低优先级场景页面栈;Step 0514: According to the priority order of multiple scene pages, push the high-priority scene page nodes into the low-priority scene page stack;
步骤0515:控制车辆1000执行高优先级的场景页面对应的相应操作。Step 0515: Control the vehicle 1000 to perform corresponding operations corresponding to high-priority scene pages.
在某些实施方式中语音控制装置100包括判断模块、第九处理模块和第十处理模块,本申请的语音控制方法可以由本申请实施方式的语音控制装置100实现,其中步骤0513可以由判断模块实现,步骤0514可以由第九处理模块实现,步骤0515可以由第十处理模块实现,也即是说,判断模块用于判断动态场景中多个场景页面的优先级顺序。第九处理模块用于根据多个场景页面的优先级顺序将高优先级的场景页面节点压入低优先级场景页面栈。第十处理模块用于控制车辆1000执行高优先级的场景页面对应的相应操作。In some embodiments, the voice control device 100 includes a judging module, a ninth processing module, and a tenth processing module. The voice control method of the present application can be realized by the voice control device 100 in the embodiment of the present application, and step 0513 can be realized by the judging module , Step 0514 can be implemented by the ninth processing module, and step 0515 can be implemented by the tenth processing module, that is to say, the judging module is used to judge the priority order of multiple scene pages in the dynamic scene. The ninth processing module is used to push the high-priority scene page nodes into the low-priority scene page stack according to the priority order of the multiple scene pages. The tenth processing module is used to control the vehicle 1000 to perform corresponding operations corresponding to high-priority scene pages.
在某些实施方式中,本申请实施方式的语音控制方法可以由本申请实施方式的服务器500实现,其中,步骤0513、步骤0514和步骤0515均可以由处理器300实现,也即是说,处理器300可用于:判断动态场景中多个场景页面的优先级顺序;根据多个场景页面的优先级顺序将高优先级的场景页面节点压入低优先级场景页面栈;控制车辆1000执行高优先级的场景页面对应的相应操作。In some implementations, the voice control method of the implementation of the present application can be implemented by the server 500 of the implementation of the application, wherein, step 0513, step 0514 and step 0515 can all be implemented by the processor 300, that is to say, the processor 300 can be used to: determine the priority order of multiple scene pages in a dynamic scene; push high-priority scene page nodes into the low-priority scene page stack according to the priority order of multiple scene pages; control the vehicle 1000 to execute high-priority The corresponding operation corresponding to the scene page.
具体地,动态场景可以包括多个场景页面,多个场景页面可以进行优先级排序,高优先级的场景页面节点可以压入低优先级场景页面栈。请参阅图18,在图18中包括3个场景页面,3个场景页面分别为第一场景页面A1、第二场景页面A2和第三场景页面A3。3个场景页面可以看成一个类堆栈,每个场景页面都对应一个对话任务,也即是说,每个栈都可以看成一个对话任务。在某些实施方式中,优先级包括页面深度和任务内每个元素的优先级。其中,页面深度为X,任务内每个元素的优先级为Y,优先级Y越高,页面深度越高。如此,在处理较为复杂的多轮对话的情况下,能够利用场景页面的逻辑关系以更新动态场景。在图18中,在第一场景页面A1上命中“详情”按钮,则会弹出第三场景页面A3;若在当前第二场景页面A2去命中第一场景页面A1上的“详情”按钮,则会弹出第三场景页面A3覆盖在第二场景页面A2上。如此,可以理解为多个场景页面中,先入的场景页面后弹出,后入的场景页面先弹出。Specifically, a dynamic scene may include multiple scene pages, and multiple scene pages may be prioritized, and high-priority scene page nodes may be pushed into a low-priority scene page stack. Please refer to Figure 18. In Figure 18, there are 3 scene pages, the 3 scene pages are respectively the first scene page A1, the second scene page A2 and the third scene page A3. The 3 scene pages can be regarded as a class stack, Each scene page corresponds to a dialogue task, that is to say, each stack can be regarded as a dialogue task. In some implementations, the priority includes the page depth and the priority of each element within the task. Wherein, the page depth is X, the priority of each element in the task is Y, and the higher the priority Y, the higher the page depth. In this way, in the case of dealing with relatively complicated multi-round dialogues, the logical relationship of the scene page can be used to update the dynamic scene. In Figure 18, if you hit the "Details" button on the first scene page A1, the third scene page A3 will pop up; if you hit the "Details" button on the first scene page A1 on the current second scene page A2, then A third scene page A3 will pop up and cover the second scene page A2. In this way, it can be understood that among multiple scene pages, the scene page entered first will pop up later, and the scene page entered last will pop up first.
请参阅图19,本申请实施方式还提供一种计算机程序的非易失性计算机可读存储介质800,其上存储有计算机程序,当计算机程序被一个或多个处理器300执行时,使得处理器300执行上述任一实施方式的控制方法的步骤。Referring to FIG. 19 , the embodiment of the present application also provides a computer program non-volatile computer-readable storage medium 800, on which a computer program is stored. When the computer program is executed by one or more processors 300, the processing The controller 300 executes the steps of the control method in any of the above-mentioned implementation manners.
例如,程序被处理器20执行的情况下,实现以下语音控制方法的步骤:For example, when the program is executed by the processor 20, the steps of the following voice control method are realized:
步骤01:接收当前轮次语音指令,接收图形用户界面信息,融合图形用户界面信息和历史轮次的语音对话信息以生成动态场景;Step 01: Receive the voice command of the current round, receive the GUI information, fuse the GUI information and the voice dialogue information of the previous round to generate a dynamic scene;
步骤02:根据动态场景生成场景语义文档;Step 02: Generate scene semantic documents according to dynamic scenes;
步骤03:根据场景语义文档确定当前轮次语音指令对应的语义理解;Step 03: Determine the semantic understanding corresponding to the voice command of the current round according to the scene semantic document;
步骤04:根据当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果;Step 04: Determine the semantic understanding result according to the semantic understanding corresponding to the current round of voice commands or the global semantic understanding;
步骤05:根据语义理解结果控制车辆1000执行相应操作。Step 05: Control the vehicle 1000 to perform corresponding operations according to the semantic understanding result.
如此,本申请实施方式的计算机程序的非易失性计算机可读存储介质800,能够融合图形用户界面信息和历史轮次的语音对话信息以生成动态场景,根据动态场景生成场景语义文档,根据场景语义文档能够限制任务内的语义理解过程,对多轮状态的语音进行管理,从而提高这个垂域上多轮对话语义理解的精度。In this way, the non-volatile computer-readable storage medium 800 of the computer program in the embodiment of the present application can fuse the graphical user interface information and the voice dialogue information of historical rounds to generate a dynamic scene, generate a scene semantic document according to the dynamic scene, and generate a scene semantic document according to the scene. Semantic documents can limit the semantic understanding process within a task and manage speech in multiple rounds, thereby improving the accuracy of semantic understanding of multiple rounds of conversations in this vertical domain.
可以理解,计算机程序包括计算机程序代码。计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读存储介质可以包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、以及软件分发介质等。It can be understood that a computer program includes computer program code. The computer program code may be in source code form, object code form, executable file or some intermediate form, etc. The computer-readable storage medium may include: any entity or device capable of carrying computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random memory Access memory (RAM, Random Access Memory), and software distribution media, etc.
在本说明书的描述中,参考术语“一个实施方式”、“一些实施方式”、“示意性实施方式”、“示例”、“具体示例”或“一些示例”等的描述意指结合所述实施方式或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施方式或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施方式或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施方式或示例中以合适的方式结合。In the description of this specification, reference to the terms "one embodiment", "some embodiments", "exemplary embodiments", "example", "specific examples" or "some examples" etc. The specific features, structures, materials or features described in the manner or example are included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the described specific features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples.
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments or portions of code comprising one or more executable instructions for implementing specific logical functions or steps of the process , and the scope of preferred embodiments of the present application includes additional implementations in which functions may be performed out of the order shown or discussed, including in substantially simultaneous fashion or in reverse order depending on the functions involved, which shall It should be understood by those skilled in the art to which the embodiments of the present application belong.
在流程图中表示或在此以其他方式描述的逻辑和/或步骤,例如,可以被认为是用于实现逻辑功能的可执行指令的定序列表,可以具体实现在任何计算机可读介质中,以供指令执行系统、装置或设备(如基于计算机的系统、包括处理模块的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用,或结合这些指令执行系统、装置或设备而使用。就本说明书而言,"计算机可读介质"可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。计算机可读介质的更具体的示例(非穷尽性列表)包括以下:具有一个或多个布线的电连接部(电子装置),便携式计算机盘盒(磁装置),随机存取存储器(RAM),只读存储器(ROM),可擦除可编辑只读存储器(EPROM或闪速存储器),光纤装置,以及便携式光盘只读存储器(CDROM)。另外,计算机可读介质甚至可以是可在其上打印所述程序的纸或其他合适的介质,因为可以例如通过对纸或其他介质进行光学扫描,接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序,然后将其存储在计算机存储器中。The logic and/or steps represented in the flowcharts or otherwise described herein, for example, can be considered as a sequenced listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium, For use with instruction execution systems, devices, or devices (such as computer-based systems, systems including processing modules, or other systems that can fetch instructions from instruction execution systems, devices, or devices and execute instructions), or in conjunction with these instruction execution systems, devices or equipment used. For the purposes of this specification, a "computer-readable medium" may be any device that can contain, store, communicate, propagate or transmit a program for use in or in conjunction with an instruction execution system, device or device. More specific examples (non-exhaustive list) of computer-readable media include the following: electrical connection with one or more wires (electronic device), portable computer disk case (magnetic device), random access memory (RAM), Read Only Memory (ROM), Erasable and Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Compact Disc Read Only Memory (CDROM). In addition, the computer-readable medium may even be paper or other suitable medium on which the program can be printed, since the program can be read, for example, by optically scanning the paper or other medium, followed by editing, interpretation or other suitable processing if necessary. The program is processed electronically and stored in computer memory.
应当理解,本申请的实施方式的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。It should be understood that each part of the embodiments of the present application may be realized by hardware, software, firmware or a combination thereof. In the embodiments described above, various steps or methods may be implemented by software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, it can be implemented by any one or combination of the following techniques known in the art: Discrete logic circuits, ASICs with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,程序在执行时,包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps carried by the method of the above-mentioned embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the program is executed When, one or a combination of the steps of the method embodiment is included.
此外,在本申请的各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a computer-readable storage medium.
上述提到的存储介质可以是只读存储器,磁盘或光盘等。The storage medium mentioned above may be a read-only memory, a magnetic disk or an optical disk, and the like.
尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施方式进行变化、修改、替换和变型。Although the embodiments of the present application have been shown and described above, it can be understood that the above embodiments are exemplary and should not be construed as limitations on the present application, and those skilled in the art can make the above-mentioned The embodiments are subject to changes, modifications, substitutions and variations.

Claims (12)

  1. 一种语音控制方法,其特征在于,包括:A voice control method, characterized in that, comprising:
    接收当前轮次语音指令,接收图形用户界面信息,融合所述图形用户界面信息和历史轮次的语音对话信息以生成动态场景;receiving the current round of voice commands, receiving GUI information, and fusing the GUI information with historical round voice dialogue information to generate a dynamic scene;
    根据所述动态场景生成场景语义文档;Generating a scene semantic document according to the dynamic scene;
    根据所述场景语义文档确定所述当前轮次语音指令对应的语义理解;Determining the semantic understanding corresponding to the voice command of the current round according to the scene semantic document;
    根据所述当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果;Determine the semantic understanding result according to the semantic understanding corresponding to the current round of voice instructions or the global semantic understanding;
    根据所述语义理解结果控制车辆执行相应操作。The vehicle is controlled to perform corresponding operations according to the semantic understanding result.
  2. 根据权利要求1所述的语音控制方法,其特征在于,所述接收当前轮次语音指令,接收图形用户界面信息,融合所述图形用户界面信息和历史轮次的语音对话信息以生成动态场景,包括:The voice control method according to claim 1, characterized in that, receiving the voice command of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the historical rounds to generate a dynamic scene, include:
    在接收到所述当前轮次语音指令的情况下,根据历史轮次的语音对话信息确定语义空间,所述语义空间用于表征当前轮次语音指令的语义理解指向;In the case of receiving the voice command of the current round, determine the semantic space according to the voice dialogue information of the historical round, and the semantic space is used to represent the semantic understanding direction of the voice command of the current round;
    根据所述语义空间和图形用户界面信息确定动态场景。A dynamic scene is determined according to the semantic space and GUI information.
  3. 根据权利要求1所述的语音控制方法,其特征在于,所述接收当前轮次语音指令,接收图形用户界面信息,融合所述图形用户界面信息和历史轮次的语音对话信息以生成动态场景,包括:The voice control method according to claim 1, characterized in that, receiving the voice command of the current round, receiving the graphical user interface information, and fusing the graphical user interface information and the voice dialogue information of the historical rounds to generate a dynamic scene, include:
    在接收到所述当前轮次语音指令的情况下,载入并解析所述历史轮次的语音对话信息中包括的动态场景元素;In the case of receiving the voice command of the current round, load and analyze the dynamic scene elements included in the voice dialogue information of the historical round;
    根据所述动态场景元素、历史轮次的语音对话信息生成动态场景。A dynamic scene is generated according to the dynamic scene elements and the voice dialogue information of historical rounds.
  4. 根据权利要求3所述的语音控制方法,其特征在于,所述场景语义文档的文档数据与所述动态场景元素的相似度大于相似度阈值。The voice control method according to claim 3, wherein the similarity between the document data of the scene semantic document and the dynamic scene element is greater than a similarity threshold.
  5. 根据权利要求1所述的语音控制方法,其特征在于,所述根据所述当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果,包括:The voice control method according to claim 1, wherein the determining the semantic understanding result according to the semantic understanding corresponding to the current round of voice instructions or the global semantic understanding comprises:
    利用所述当前轮次语音指令对应的语义理解搜索数据库;Using the semantic understanding corresponding to the current round of voice commands to search the database;
    在搜索结果存在与所述当前轮次语音指令对应的语义理解相匹配的结果时,将所述当前轮次语音指令对应的语义理解作为所述语义理解结果;When there is a search result that matches the semantic understanding corresponding to the voice command of the current round, use the semantic understanding corresponding to the voice command of the current round as the semantic understanding result;
    在搜索结果不存在与所述当前轮次语音指令对应的语义理解相匹配的结果时,将所述全局语义理解作为所述语义理解结果。When there is no search result matching the semantic understanding corresponding to the voice instruction of the current round, the global semantic understanding is used as the semantic understanding result.
  6. 根据权利要求5所述的语音控制方法,其特征在于,所述根据所述语义理解结果控制车辆执行相应操作,包括:The voice control method according to claim 5, wherein the controlling the vehicle to perform corresponding operations according to the semantic understanding result comprises:
    在所述当前轮次语音指令对应的语义理解作为所述语义理解结果时,对所述历史轮次的语音对话信息进行更新,并发送操作指令以使车辆执行相应操作;When the semantic understanding corresponding to the voice command of the current round is taken as the result of the semantic understanding, updating the voice dialogue information of the historical round, and sending an operation instruction to enable the vehicle to perform a corresponding operation;
    在所述全局语义理解作为所述语义理解结果时,控制所述车辆发起新一轮对话任务。When the global semantic understanding is the result of the semantic understanding, the vehicle is controlled to initiate a new round of dialogue tasks.
  7. 根据权利要求6所述的语音控制方法,其特征在于,所述对所述历史轮次的语音对话信息进行更新,包括:The voice control method according to claim 6, wherein updating the voice dialogue information of the historical rounds comprises:
    根据所述历史轮次的语音对话信息,查询用户输出的对话动作信息和系统输出的对话动作信息,以获取用户槽位参数和系统槽位参数;According to the voice dialogue information of the historical rounds, query the dialogue action information output by the user and the dialogue action information output by the system, so as to obtain the user slot parameter and the system slot parameter;
    利用所述用户槽位参数和所述系统槽位参数执行槽位动作,更新可信槽位参数,以更新对话状态信息。Using the user slot parameter and the system slot parameter to perform a slot action, update the trusted slot parameter, and update the dialog state information.
  8. 根据权利要求7所述的语音控制方法,其特征在于,所述执行槽位动作包括延续动作、删除动作、更新动作和失效动作中的至少一种。The voice control method according to claim 7, wherein the slot execution action includes at least one of a continuation action, a deletion action, an update action and an invalidation action.
  9. 根据权利要求6所述的语音控制方法,其特征在于,所述对所述历史轮次的语音对话信息进行更新,包括:The voice control method according to claim 6, wherein updating the voice dialogue information of the historical rounds comprises:
    判断所述动态场景中多个场景页面的优先级顺序;judging the priority sequence of multiple scene pages in the dynamic scene;
    根据所述多个场景页面的优先级顺序将高优先级的所述场景页面节点压入低优先级场景页面栈;According to the priority order of the plurality of scene pages, the high priority scene page nodes are pushed into the low priority scene page stack;
    控制所述车辆执行高优先级的所述场景页面对应的相应操作。The vehicle is controlled to perform a corresponding operation corresponding to the high priority scene page.
  10. 一种语音控制装置,其特征在于,包括:A voice control device, characterized in that it comprises:
    第一生成模块,所述第一生成模块用于接收当前轮次语音指令,接收图形用户界面信息,融合所述图形用户界面信息和历史轮次的语音对话信息以生成动态场景;A first generation module, the first generation module is used to receive the voice command of the current round, receive the graphical user interface information, and fuse the graphical user interface information and the voice dialogue information of the historical round to generate a dynamic scene;
    第二生成模块,所述第二生成模块用于根据所述动态场景生成场景语义文档;A second generating module, the second generating module is used to generate a scene semantic document according to the dynamic scene;
    第一确定模块,所述第一确定模块用于根据所述场景语义文档确定所述当前轮次语音指令对应的语义理解;A first determination module, the first determination module is used to determine the semantic understanding corresponding to the current round of voice instructions according to the scene semantic document;
    第二确定模块,所述第二确定模块用于根据所述当前轮次语音指令对应的语义理解或全局语义理解确定语义理解结果;A second determination module, the second determination module is used to determine the semantic understanding result according to the semantic understanding or global semantic understanding corresponding to the voice command of the current round;
    控制模块,所述控制模块用于根据所述语义理解结果控制车辆执行相应操作。A control module, the control module is used to control the vehicle to perform corresponding operations according to the semantic understanding result.
  11. 一种服务器,其特征在于,包括存储器和处理器,所述存储器中存储有计算机程序,所述计算机程序被所述处理器执行时,实现权利要求1-9任一项所述的语音控制方法。A server, characterized in that it includes a memory and a processor, and a computer program is stored in the memory, and when the computer program is executed by the processor, the voice control method described in any one of claims 1-9 is realized .
  12. 一种计算机程序的非易失性计算机可读存储介质,其特征在于,当所述计算机程序被一个或多个处理器执行时,实现权利要求1-9中任一项所述的语音控制方法。A non-volatile computer-readable storage medium of a computer program, characterized in that, when the computer program is executed by one or more processors, the voice control method according to any one of claims 1-9 is realized .
PCT/CN2022/092246 2021-06-03 2022-05-11 Voice control method, voice control device, server, and storage medium WO2022252946A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110619459.XA CN113421561B (en) 2021-06-03 2021-06-03 Voice control method, voice control device, server, and storage medium
CN202110619459.X 2021-06-03

Publications (1)

Publication Number Publication Date
WO2022252946A1 true WO2022252946A1 (en) 2022-12-08

Family

ID=77713765

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/092246 WO2022252946A1 (en) 2021-06-03 2022-05-11 Voice control method, voice control device, server, and storage medium

Country Status (2)

Country Link
CN (1) CN113421561B (en)
WO (1) WO2022252946A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117021083A (en) * 2023-08-09 2023-11-10 北京小米机器人技术有限公司 Robot, control method and device thereof, and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113421561B (en) * 2021-06-03 2024-01-09 广州小鹏汽车科技有限公司 Voice control method, voice control device, server, and storage medium
CN115346530B (en) * 2022-10-19 2023-01-13 亿咖通(北京)科技有限公司 Voice control method, device, equipment, medium, system and vehicle

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060123358A1 (en) * 2004-12-03 2006-06-08 Lee Hang S Method and system for generating input grammars for multi-modal dialog systems
CN111767021A (en) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 Voice interaction method, vehicle, server, system and storage medium
CN111768777A (en) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 Voice control method, information processing method, vehicle and server
CN111768780A (en) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 Voice control method, information processing method, vehicle and server
CN112164400A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112164401A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112365892A (en) * 2020-11-10 2021-02-12 杭州大搜车汽车服务有限公司 Man-machine interaction method, device, electronic device and storage medium
CN113421561A (en) * 2021-06-03 2021-09-21 广州小鹏汽车科技有限公司 Voice control method, voice control device, server and storage medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170293610A1 (en) * 2013-03-15 2017-10-12 Bao Tran Voice assistant
US9582246B2 (en) * 2014-03-04 2017-02-28 Microsoft Technology Licensing, Llc Voice-command suggestions based on computer context
CN107146622B (en) * 2017-06-16 2021-02-19 合肥美的智能科技有限公司 Refrigerator, voice interaction system, method, computer device and readable storage medium
CN109545203A (en) * 2018-12-14 2019-03-29 深圳壹账通智能科技有限公司 Audio recognition method, device, equipment and storage medium
CN111753061B (en) * 2019-03-27 2024-03-12 北京猎户星空科技有限公司 Multi-round dialogue processing method and device, electronic equipment and storage medium
CN112309384B (en) * 2019-08-28 2023-01-06 抖音视界有限公司 Voice recognition method, device, electronic equipment and medium
CN112017663B (en) * 2020-08-14 2024-04-30 博泰车联网(南京)有限公司 Voice generalization method and device and computer storage medium
CN112053688B (en) * 2020-08-27 2024-03-08 海信视像科技股份有限公司 Voice interaction method, interaction equipment and server
CN112182196A (en) * 2020-11-03 2021-01-05 海信视像科技股份有限公司 Service equipment applied to multi-turn conversation and multi-turn conversation method
CN112885354B (en) * 2021-01-25 2022-09-23 海信视像科技股份有限公司 Display device, server and display control method based on voice

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060123358A1 (en) * 2004-12-03 2006-06-08 Lee Hang S Method and system for generating input grammars for multi-modal dialog systems
CN111767021A (en) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 Voice interaction method, vehicle, server, system and storage medium
CN111768777A (en) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 Voice control method, information processing method, vehicle and server
CN111768780A (en) * 2020-06-28 2020-10-13 广州小鹏车联网科技有限公司 Voice control method, information processing method, vehicle and server
CN112164400A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112164401A (en) * 2020-09-18 2021-01-01 广州小鹏汽车科技有限公司 Voice interaction method, server and computer-readable storage medium
CN112365892A (en) * 2020-11-10 2021-02-12 杭州大搜车汽车服务有限公司 Man-machine interaction method, device, electronic device and storage medium
CN113421561A (en) * 2021-06-03 2021-09-21 广州小鹏汽车科技有限公司 Voice control method, voice control device, server and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117021083A (en) * 2023-08-09 2023-11-10 北京小米机器人技术有限公司 Robot, control method and device thereof, and storage medium

Also Published As

Publication number Publication date
CN113421561A (en) 2021-09-21
CN113421561B (en) 2024-01-09

Similar Documents

Publication Publication Date Title
WO2022252946A1 (en) Voice control method, voice control device, server, and storage medium
US10733983B2 (en) Parameter collection and automatic dialog generation in dialog systems
JP6305588B2 (en) Extended conversation understanding architecture
EP3491533B1 (en) Providing command bundle suggestions for an automated assistant
JP6960006B2 (en) How and system to handle unintentional queries in conversational systems
EP3389044A1 (en) Management layer for multiple intelligent personal assistant services
US20190147044A1 (en) Underspecification of intents in a natural language processing system
US11823661B2 (en) Expediting interaction with a digital assistant by predicting user responses
US20200259891A1 (en) Facilitating Interaction with Plural BOTs Using a Master BOT Framework
WO2018213740A1 (en) Action recipes for a crowdsourced digital assistant system
US20110153322A1 (en) Dialog management system and method for processing information-seeking dialogue
JP7300435B2 (en) Methods, apparatus, electronics, and computer-readable storage media for voice interaction
US20180075131A1 (en) Computerized natural language query intent dispatching
JP2020079921A (en) Voice interaction realizing method, device, computer device and program
US11069351B1 (en) Vehicle voice user interface
CN109408799B (en) Semantic decision method and system
US20220283831A1 (en) Action recipes for a crowdsourced digital assistant system
CN111813900B (en) Multi-round dialogue processing method and device, electronic equipment and storage medium
JP7347217B2 (en) Information processing device, information processing system, information processing method, and program
US20180366123A1 (en) Representing Results From Various Speech Services as a Unified Conceptual Knowledge Base
CN115129878A (en) Conversation service execution method, device, storage medium and electronic equipment
KR20220122385A (en) Apparatus and method for providing ethics-based service
US11677832B2 (en) Voice activated device enabling
KR20200119035A (en) Dialogue system, electronic apparatus and method for controlling the dialogue system
CN110704592B (en) Statement analysis processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22814993

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22814993

Country of ref document: EP

Kind code of ref document: A1