WO2024002298A1 - 一种语音指令处理方法、装置、系统以及存储介质 - Google Patents

一种语音指令处理方法、装置、系统以及存储介质 Download PDF

Info

Publication number
WO2024002298A1
WO2024002298A1 PCT/CN2023/104190 CN2023104190W WO2024002298A1 WO 2024002298 A1 WO2024002298 A1 WO 2024002298A1 CN 2023104190 W CN2023104190 W CN 2023104190W WO 2024002298 A1 WO2024002298 A1 WO 2024002298A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice command
voice
historical
slot
command
Prior art date
Application number
PCT/CN2023/104190
Other languages
English (en)
French (fr)
Inventor
张亚兵
韩骁枫
张田
陈开济
许坤
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2024002298A1 publication Critical patent/WO2024002298A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present application relates to the field of voice control technology, and in particular, to a voice command processing method, device, system and storage medium.
  • voice control is usually implemented through voice assistants. Users can input voice commands through the voice assistant, and then the voice assistant controls the electronic device to perform operations corresponding to the voice commands based on the voice commands input by the user. Especially in the field of smart homes, voice assistants can be used as the control port of smart homes to automatically control smart devices directly through voice conversations, making it convenient for users to use various devices.
  • Embodiments of the present application provide a voice command processing method, device, system and storage medium to determine the complete semantics of the voice command when the semantics of the voice command are missing, thereby enabling the voice command to be executed.
  • the first aspect provides a voice command processing method, which can be executed by a voice assistant.
  • the method includes: acquiring a first voice command, determining the intention of the first voice command, and determining the missing slot of the first voice command according to the intention of the first voice command; acquiring the first voice command in the historical voice command set.
  • Two voice commands, the second voice command is related to the first voice command; the slot of the first voice command is determined according to the slot of the second voice command.
  • the historical voice command set related to the first voice command can be used.
  • the slot of the second voice command is used to fill the slot of the first voice command, thereby obtaining a semantically complete voice command, thereby enabling the voice command to be executed.
  • the determining the slot of the first voice instruction according to the slot of the second voice instruction includes: the missing slot of the first voice instruction is determined by the slot of the second voice instruction. Corresponding slots are provided.
  • the method further includes: changing the slot of the second voice instruction Add to the first voice command, and store the first voice command added with the slot into the historical voice command set.
  • the complete first voice command is stored in the historical voice command set, which can provide a basis for subsequent voice processing operations.
  • the first voice command can be stored in the historical voice command set.
  • adding the slot of the second voice instruction to the first voice instruction includes: obtaining the slot in the structured data of the second voice instruction, the second voice instruction
  • the slots in the structured data are slot information expressed in natural language or protocol parameters, and the protocol parameters are protocol parameters obtained by mapping the slot information expressed in natural language; the second voice
  • the slot in the structured data of the instruction is added to the structured data of the first voice instruction.
  • the method further includes: if the first voice instruction includes a demonstrative pronoun used to indicate the slot of the first voice instruction, then deleting the demonstrative pronoun from the first voice instruction. .
  • deleting the demonstrative pronoun used to indicate the slot can make the semantics of the voice command clearer.
  • the obtaining the second voice instruction in the historical voice instruction set includes: according to the first language The correlation degree between the voice command and the historical voice command in the historical voice command set is obtained, and the second voice command related to the first voice command in the historical voice command set is obtained.
  • obtaining the second voice instruction related to the first voice instruction in the historical voice instruction set includes: : According to the first voice command, the intention of the first voice command and/or the associated information corresponding to the first voice command, as well as each historical voice command and the intention of each historical voice command in the historical voice command set and/or corresponding association information, to determine the correlation between the first voice instruction and each historical voice instruction in the historical voice instruction set; wherein the association information corresponding to the first voice instruction is obtained after receiving the first voice instruction.
  • the matching operation of the voice command is performed according to the associated information corresponding to the voice command, which can make the matching result (ie, the matched second voice command) more accurate.
  • obtaining the second voice command in the historical voice command set includes: the first electronic device sends a first request message to the cloud or a third electronic device, and the first request message is used to Requesting to obtain the voice command associated with the first voice command in the historical voice command set; wherein the first electronic device is the receiving device of the first voice command; the first electronic device receives the cloud Or the first response message sent by the third electronic device, the first response message carries the second voice command, and the second voice command is based on the first voice command and the historical voice command set.
  • the correlation degree of historical voice instructions is obtained from the historical voice instruction set.
  • the first request message carries the first voice instruction, the intention of the first voice instruction, and/or the association information corresponding to the first voice instruction.
  • the associated information corresponding to the first voice instruction includes at least one of the following:
  • Device information is the information of the receiving device of the first voice instruction
  • Location information the location information being the location information of the receiving device of the first voice command
  • Time information includes the reception time of the first voice command, and/or the time interval between the first voice command and the previously received voice command;
  • the user identity information is associated with the characteristic information of the audio data of the first voice instruction.
  • the associated information corresponding to the voice command may include information of multiple different dimensions, thereby improving the accuracy of matching.
  • the obtaining the second voice command in the historical voice command set, and the determining the slot of the first voice command according to the slot of the second voice command include: cloud According to the correlation between the first voice command and the historical voice command in the historical voice command set, obtain the second voice command related to the first voice command in the historical voice command set; according to the second voice command The slot determines the slot of the first voice command, and the slot in which the first voice command is missing is provided by the slot corresponding to the second voice command.
  • the obtaining the first voice command includes: converting the audio data of the first voice command from the first electronic device on the cloud to obtain corresponding text data; and determining the The intention of the first voice command, and determining the missing slot of the first voice command according to the intention of the first voice command, including: the cloud parsing the text data to obtain the intention of the first voice command , and determine the missing slot of the first voice command according to the intention of the first voice command; said obtaining the second voice command in the historical voice command set, and determining the slot according to the second voice command
  • the slot of the first voice command includes: the cloud obtaining the second voice command in the historical voice command set, and determining the slot of the first voice command according to the slot of the second voice command.
  • the set of historical voice instructions includes structured data of historical voice instructions, and the structured data of historical voice instructions includes intent and slot.
  • a voice command processing system including:
  • An automatic speech recognition module used to convert the audio data of the first voice instruction into text data
  • a natural language understanding module used to analyze the text data of the first voice instruction and obtain the intention of the first voice instruction
  • a processing module if it is determined that the first voice command is missing a slot according to the intention of the first voice command, obtain historical voice commands For the second voice command in the set, the slot of the first voice command is determined according to the slot of the second voice command; wherein the second voice command is related to the first voice command.
  • the slot where the first voice command is missing is provided by the slot corresponding to the second voice command.
  • the processing module is further configured to: after determining the slot of the first voice instruction according to the slot of the second voice instruction, change the slot of the second voice instruction into bit is added to the first voice command, and the first voice command added with the slot is stored in the historical voice command set.
  • the processing module is specifically configured to: obtain slots in the structured data of the second voice instruction, where the slots in the structured data of the second voice instruction are slots expressed in natural language.
  • the information may be a protocol parameter, and the protocol parameter is a protocol parameter obtained by mapping the slot information expressed in the natural language; the slot in the structured data of the second voice command is added to the third A voice command in structured data.
  • the processing module is further configured to: if the first voice command includes a demonstrative pronoun used to indicate the slot of the first voice command, change the demonstrative pronoun from the first voice command. Delete in the command.
  • the processing module is specifically configured to: according to the correlation between the first voice instruction and the historical voice instructions in the historical voice instruction set, obtain the historical voice instruction set that is related to the A second voice command related to the first voice command.
  • the processing module is specifically configured to: based on the first voice instruction, the intention of the first voice instruction and/or the associated information corresponding to the first voice instruction, and the historical voice instruction set Each historical voice command in the historical voice command, the intention of each historical voice command and/or the corresponding associated information, determine the correlation between the first voice command and each historical voice command in the historical voice command set; wherein, the first voice command The associated information corresponding to the instruction is collected when the first voice instruction is received, and the associated information corresponding to the historical voice instruction is collected when the historical voice instruction is received; according to the first voice instruction and the The correlation degree of each historical voice command in the historical voice command set is used to select the second voice command related to the first voice command from the historical voice command set.
  • the associated information corresponding to the first voice instruction includes at least one of the following:
  • Device information is the information of the receiving device of the first voice instruction
  • the user account information is the user account information for logging into the voice assistant
  • Location information the location information being the location information of the receiving device of the first voice command
  • Time information includes the reception time of the first voice command, and/or the time interval between the first voice command and the previously received voice command;
  • the user identity information is associated with the characteristic information of the audio data of the first voice instruction.
  • the set of historical voice instructions includes structured data of historical voice instructions, and the structured data of historical voice instructions includes intent and slot.
  • the slot is a device, application or service intended to execute voice instructions.
  • the automatic speech recognition module, the natural language understanding module, and the processing module are located in the first electronic device; or, the automatic speech recognition module and the natural language understanding module are located in the first electronic device.
  • An electronic device, the processing module is located in the cloud or a third electronic device; or, the automatic speech recognition module is located in the first electronic device, and the natural language understanding module and the processing module are located in the cloud; or, the automatic speech The recognition module, the natural language understanding module, and the processing module are located in the cloud.
  • the first electronic device can send a request message to the cloud to request the cloud processing module to perform corresponding processing operations; after the cloud processing module completes the corresponding processing operations, it can return a response message to the first electronic device.
  • the request message may carry structured data (including intent) of the first voice instruction and/or associated information corresponding to the first voice instruction, and the response message may carry the third voice instruction. Second voice command or the slot for the second voice command.
  • the first electronic device can send a request message to the cloud to request the cloud processing module to perform corresponding processing operations; after the cloud processing module completes the corresponding processing operations, it can return a response message to the first electronic device.
  • the request message may carry text data of the first voice instruction and/or association information corresponding to the first voice instruction
  • the response message may carry the second voice instruction or the Slot for the second voice command.
  • the above system further includes: an execution module, configured to execute the first voice instruction or instruct an execution device of the first voice instruction according to the intention and slot of the first voice instruction. Execute the first voice instruction, and the execution device is provided by the slot of the first voice instruction.
  • the above system also includes: a natural language generation module and a text-to-speech module;
  • the execution module is also used to obtain the execution result of the first voice instruction
  • the natural language generation module is used to convert the execution result of the first voice instruction into text data, where the text data is natural language in text format;
  • the text-to-speech module is used to convert the text data into audio data.
  • an electronic device including: one or more processors; the one or more memories store one or more computer programs, the one or more computer programs include instructions, and when the instructions When executed by the one or more processors, the electronic device is caused to execute the method described in any one of the above first aspects.
  • a fourth aspect provides a computer-readable storage medium, including a computer program, which when the computer program is run on an electronic device, causes the electronic device to perform the method described in any one of the above first aspects.
  • a fifth aspect provides a computer program product that, when run on an electronic device, causes the electronic device to execute the method described in any one of the above first aspects.
  • a chip system including: a memory for storing a computer program; a processor; when the processor calls and runs the computer program from the memory, the electronic device installed with the chip system executes the above first step. The method described in any of the aspects.
  • Figure 1 is a schematic diagram of the system architecture of Scenario 1 in the embodiment of the present application.
  • Figure 2 is a schematic diagram of the system architecture of scenario 2 in the embodiment of the present application.
  • Figure 3 is a schematic diagram of the system architecture of scenario three in the embodiment of the present application.
  • Figure 4 is a schematic diagram of the system architecture of scenario four in the embodiment of the present application.
  • Figure 5 is a schematic diagram of the system architecture of scenario five in the embodiment of the present application.
  • Figure 6 is a schematic diagram of the internal hardware structure of the electronic device provided by the embodiment of the present application.
  • Figure 7 is a schematic diagram of the software structure of the electronic device in the embodiment of the present application.
  • Figure 8a, Figure 8b, and Figure 8c are respectively a schematic diagram of the deployment of the functional modules of the voice assistant in the embodiment of the present application;
  • FIG. 9 is a schematic diagram of the execution logic of the voice command processing method provided by the embodiment of the present application.
  • Figure 10 is a schematic flowchart of a voice command processing method provided by an embodiment of the present application.
  • Figure 11 is a schematic diagram of a clarification execution device through multiple rounds of dialogue in an embodiment of the present application.
  • Figure 12 is a schematic flowchart of a voice command processing method provided by an embodiment of the present application.
  • Figure 13 is a schematic flow chart of another voice command processing method provided by an embodiment of the present application.
  • Figure 14a and Figure 14b are schematic diagrams of the voice command processing scenario in Example 1 of the embodiment of the present application.
  • Figures 15a and 15b are schematic diagrams of the voice command processing scenario in Example 2 of the embodiment of the present application.
  • one or more refers to one, two or more than two; "and/or” describes the association relationship of associated objects, indicating that three relationships can exist; for example, A and/or B can mean: A alone exists, A and B exist simultaneously, and B exists alone, where A and B can be singular or plural.
  • the character "/" generally indicates that the related objects are in an "or” relationship.
  • the current voice assistant lacks the ability to process voice commands in multiple rounds, requiring the user to provide complete information for each voice command input, such as the intention and slot (such as the device used to execute the intention, or the device used to execute the command). Applications or services with this intention, etc.), if the semantics of the voice command input by the user are incomplete, such as a lack of slots, the voice assistant cannot understand the voice command and the voice command cannot be executed.
  • the intention and slot such as the device used to execute the intention, or the device used to execute the command.
  • embodiments of the present application provide a voice command processing method, device, system and storage medium, so that when the semantics of the voice command are missing, the voice assistant can determine the missing part ( For example, execution device information), thereby making the voice command semantically complete and enabling the voice command to be executed.
  • the voice assistant can determine the missing part ( For example, execution device information), thereby making the voice command semantically complete and enabling the voice command to be executed.
  • the "voice assistant” in the embodiment of this application refers to a voice application program that helps users solve various problems through intelligent interaction of intelligent dialogue and instant question and answer, for example, it can help users solve life-related problems.
  • Voice assistants usually have automatic speech recognition (automatic speech recognition, ASR) functions and natural language understanding (natural language understanding, NLU) modules.
  • ASR automatic speech recognition
  • NLU natural language understanding
  • all functions of the voice assistant can be implemented in one electronic device.
  • the function of the voice assistant can be implemented in the cloud, and the receiving function of audio data of the voice command can be implemented in the electronic device.
  • some functions of the voice assistant are implemented in the electronic device, and other functions are implemented in the cloud.
  • some functions of the voice assistant are implemented in the first electronic device, and other functions are implemented in the third electronic device.
  • the function of receiving audio data of the voice command is implemented on the first electronic device.
  • other functions of the voice assistant are implemented on the third electronic device.
  • a voice assistant is also called a voice application or a voice application program, and the embodiment of the present application does not limit the naming method of the voice assistant.
  • Intent refers to identifying what the user's actual or potential needs are.
  • intent recognition is a classifier that classifies user needs into a certain type; or, intent recognition is a sorter that sorts the set of potential user needs according to likelihood.
  • the slot is the parameter carried by the intention.
  • An intent may correspond to several slots.
  • the slot may include the execution device information of the voice command (such as the type of execution device, specifically "air conditioner"), the location of the execution device (such as the room where the execution device is located), and the application or service that executes the user's intention. wait.
  • the execution device information of the voice command such as the type of execution device, specifically "air conditioner"
  • the location of the execution device such as the room where the execution device is located
  • the application or service that executes the user's intention. wait For example, when controlling the temperature of the air conditioner, it is necessary to clearly indicate in the voice command that the execution device is the air conditioner and the location of the air conditioner (if there are multiple air conditioners nearby).
  • the above parameters are the slots for the intention of "controlling the temperature of the air conditioner".
  • the data format of the execution device information can include the following two types:
  • the first one Use natural language description and use slots.
  • the form of executing device information is room: "bedroom", device: "light”;
  • the second type Use the underlying protocol parameter description to directly record the underlying device details. For example, converting device information described in natural language such as 'bedroom' 'light' into a protocol description of the device, such as mapping 'bedroom' 'light' to a device identifier (deviceID) that complies with the Internet of Things (IoT) protocol. , device type (deviceType) and other parameters. The mapped parameters are carried in the control instruction and sent to the execution device so that the control device can execute the control instruction.
  • deviceID device identifier
  • IoT Internet of Things
  • deviceType device type
  • the mapped parameters are carried in the control instruction and sent to the execution device so that the control device can execute the control instruction.
  • the voice command input by the user is: "Turn up the temperature of the air conditioner in the living room";
  • the user intention obtained by analysis is: increase temperature (TURN_UP_TEMPRATURE);
  • the slots obtained by analysis include: slot one (location): living room; slot two (execution equipment): air conditioner.
  • voice commands appear in different data forms at different processing stages.
  • the electronic device receives the audio data of the voice command; the audio data of the voice command can be converted into corresponding text data; the text data of the voice command can be parsed into structured data of the voice command, such as including Intention, slot.
  • the voice command slot can be expressed as a natural language description slot (such as a device name, such as "air conditioner"), or can be further mapped to underlying protocol parameters (such as device type, device identification, etc.).
  • the structured data of voice commands can be further converted into control instructions that can be recognized by the execution device, so that the execution device can perform corresponding operations.
  • the electronic device may include a mobile phone, a personal computer (PC), a tablet computer, a desktop computer (desktop computer), a handheld computer, a notebook computer (laptop computer), or an ultra mobile personal computer (Ultra mobile personal computer).
  • PC personal computer
  • tablet computer a desktop computer
  • handheld computer a notebook computer
  • Ultra mobile personal computer Ultra mobile personal computer
  • -mobile personal computer UMPC
  • netbook personal digital assistant
  • PDA personal digital assistant
  • router TV and other devices.
  • electronic devices may include audio, cameras, air conditioners, refrigerators, smart curtains, desk lamps, chandeliers, rice cookers, security equipment (such as smart electronic locks), robots, sweepers, smart scales and other devices that can be connected to the home wireless LAN.
  • security equipment such as smart electronic locks
  • electronic devices may include smart headphones, smart glasses, smart watches, smart bracelets, augmented reality (AR)/virtual reality (VR) devices, wireless locators, trackers, and electronic collars
  • AR augmented reality
  • VR virtual reality
  • the electronic devices in the embodiments of the present application can also be equipment such as car audio and car air conditioners.
  • the embodiments of the present application do not place any special restrictions on the specific form of the electronic device.
  • cloud platform is a software platform that uses application virtualization technology and integrates multiple functions such as software search, download, use, management, and backup. Through this platform, all kinds of commonly used software can be encapsulated in an independent virtualization environment, so that the application software will not be coupled with the system.
  • the first electronic device 10 has an audio collection device or module (such as a microphone) that can receive the user's voice.
  • the first electronic device 10 is also provided with a voice assistant, and the first electronic device 10 also stores the third
  • An electronic device 10 receives and processes a set of voice commands (hereinafter referred to as a historical voice command set).
  • the historical voice command set includes at least one voice command.
  • the first electronic device 10 may also establish a communication connection with the second electronic device 20 .
  • the first electronic device 10 and the second electronic device 20 may be directly connected point-to-point, such as through Bluetooth, or may be connected through a local area network, or may be connected through the Internet, etc., which are not discussed here. limited.
  • the second electronic device 20 may not be equipped with a voice assistant and does not have the ability to process voice commands. Therefore, it needs to be connected to other electronic devices (such as the first electronic device 10) so that the other electronic devices can process the voice commands and receive the voice commands based on the other voice devices. and respond to control commands sent by voice commands.
  • the second electronic device 20 may be an air conditioner, lighting, curtain, etc.
  • the second electronic device 20 may also have voice processing capabilities.
  • the voice assistant When the voice assistant receives a voice command in the form of audio data (also called a voice signal), it recognizes the received voice command in the form of audio data based on the ASR function, obtains a voice command in the form of text, and reads the text based on the NLU function. Parse voice commands in the form. If the voice command does not include a slot or only uses demonstrative pronouns to indicate the slot, the voice assistant queries the historical voice command set stored in the first electronic device 10 to obtain historical voice commands related to the current voice command. According to the The slot of the historical voice command determines the slot of the current voice command, thereby completing the current voice command. Afterwards, the voice assistant can execute the voice command.
  • a voice command in the form of audio data also called a voice signal
  • the way the voice assistant executes the voice command may include the following situations:
  • the voice assistant determines that the user's intention of the voice command is to perform a certain operation, and the performer in the slot is an application or service, then the voice assistant implements the user's operation by calling the corresponding application or service on the first electronic device 10 or the cloud. intention.
  • the voice assistant determines the slot is the "Alipay” application based on the collection of historical voice commands, then the voice command will be completed and the complete voice command will be "pay with Alipay”. ", the voice assistant calls the "Alipay” application installed on the electronic device to complete the payment operation.
  • the voice command output by the user is "backup file”.
  • the voice assistant determines that the slot is the "Huawei Cloud” service, and then completes the voice command.
  • the completed voice command is " Back up files to Huawei Cloud”
  • the voice assistant calls the corresponding service interface, and by interacting with the cloud, the service located in the cloud completes the corresponding operation.
  • the voice assistant determines that the user's intention of the voice command is to perform a certain operation, and the slot is the first electronic device 10 where the voice assistant is located, and the voice command causes the first electronic device 10 to operate by calling the corresponding functional interface in the first electronic device 10 Execute this voice command.
  • the voice assistant calls the relevant functional interface of the smart TV, so that the smart TV responds to the voice command to play the movie.
  • the voice assistant determines that the intention of the voice command is to perform a certain operation, and the execution device in the slot is the second electronic device 20, then the voice assistant generates a control command that can be recognized by the second electronic device 20 based on the voice command, and sends the control command to The second electronic device 20 causes the second electronic device 20 to execute the control instruction.
  • the voice command received by the first electronic device 10 is "Play movie XXX” and the voice assistant determines that the slot is "TV” based on the historical voice command collection, then the voice command will be completed and supplemented.
  • the complete voice command is "Play movie XXX on TV”, and the voice command is converted into a control command that can be recognized by the second electronic device 20 (TV), and the control command is sent to the second electronic device 20 (TV) , so that the TV plays the movie in response to the control instruction.
  • Figure 2 is a schematic diagram of the system architecture of scenario 2 in the embodiment of the present application.
  • the first electronic device 10 has an audio collection device or module (such as a microphone) that can receive the user's voice, and a voice assistant is also provided in the first electronic device 10 .
  • a collection of historical voice commands is set in the cloud, and the first electronic device 10 can store the received and processed voice commands in the cloud.
  • other electronic devices such as the third electronic device 30 in the figure, can also store received and processed voice commands into a historical voice command collection in the cloud.
  • the first electronic device 10 may also establish a communication connection with the second electronic device 20 .
  • the first electronic device 10 and the second electronic device 20 may be directly connected point-to-point, such as through Bluetooth, or may be connected through a local area network, or may be connected through the Internet, etc., which are not discussed here. limited.
  • the first electronic device 10 and the third electronic device 30 belong to the same group.
  • Electronic devices in the same group share a historical voice command set in the cloud.
  • Electronic devices in different groups cannot share the same historical voice command set.
  • the grouping can be set by the user.
  • electronic devices in the same family residence can be divided into the same group, or electronic devices in multiple residences in the same family can be divided into the same group. The embodiments of the present application do not limit this. .
  • the voice assistant when the voice assistant receives the voice command in the form of audio data, it recognizes the received voice command in the form of audio data based on the ASR function, obtains the voice command in the form of text, and recognizes the voice command in the form of text based on the NLU function.
  • the text form of the voice command is parsed. If the voice command does not include a slot or only uses demonstrative pronouns to indicate the slot, the voice assistant queries the cloud's historical voice command collection to obtain historical voice commands related to the current voice command, and determines the slot according to the historical voice command. , determine the slot of the current voice command, thereby completing the current voice command. Afterwards, the voice assistant can execute the voice command. Among them, the way the voice assistant executes voice commands can be found in the relevant instructions in the scene.
  • Figure 3 is a schematic diagram of the system architecture of scenario three in the embodiment of the present application.
  • the first electronic device 10 and the third electronic device 30 each have an audio collection device or module (such as a microphone) that can receive the user's voice.
  • the first electronic device 10 and the third electronic device 30 are also provided with a voice collection device. assistant.
  • a communication connection is established between the first electronic device 10 and the electronic device 30 .
  • the first electronic device 10 and the third electronic device 30 may be directly connected point-to-point, such as through Bluetooth, or may be connected through a local area network, or may be connected through the Internet, etc., which are not discussed here. limited.
  • the first electronic device 10 and the third electronic device 30 belong to the same group, and electronic devices in the same group can share history.
  • Voice command collection electronic devices in different groups cannot share historical voice command collections.
  • the grouping can be set by the user. For example, electronic devices in the same family residence can be divided into the same group, or electronic devices in multiple residences in the same family can be divided into the same group. This is not the case in the embodiment of the present application. limit.
  • the first electronic device 10 and the third electronic device 30 may share historical voice instructions, so that the historical voice instruction sets in the first electronic device 10 and the third electronic device 30 are synchronized. For example, after the first electronic device 10 receives and processes a voice command, the processed voice command can be stored in the historical voice command set of the device, and the voice command can be sent to the third electronic device 30 so that the third electronic device 30 can The electronic device 30 stores the voice command into a historical voice command set in the third electronic device 30 to achieve synchronization between the historical voice command sets in the first electronic device 10 and the third electronic device 30 . For another example, the first electronic device 10 and the third electronic device 30 can synchronize the historical voice command set according to a set time or cycle.
  • the first electronic device 10 may also establish a communication connection with the second electronic device 20 .
  • the first electronic device 10 and the second electronic device 20 may be directly connected point-to-point, such as through Bluetooth, or may be connected through a local area network, or may be connected through the Internet, etc., which are not discussed here. limited.
  • the voice assistant When the voice assistant receives the voice command in the form of audio data, it recognizes the received voice command in the form of audio data based on the ASR function, obtains the voice command in the form of text, and analyzes the voice command in the text form based on the NLU function. If the voice command does not include a slot or only uses a demonstrative pronoun to indicate the slot, the voice assistant queries the historical voice command set on the electronic device (the historical voice command set is a historical voice command set shared by multiple electronic devices) , obtain historical voice commands related to the current voice command, and determine the slot of the current voice command according to the slot of the historical voice command, thereby completing the current voice command. Afterwards, the voice assistant can execute the voice command. Among them, the way the voice assistant executes voice commands can be found in the relevant instructions in the scene.
  • FIG. 4 exemplarily shows a schematic system architecture diagram of scenario four in the embodiment of the present application.
  • the ASR function and NLU function in the voice assistant are located in the first electronic device 10, and the function of replenishing slots for voice commands with missing slots (referred to as the processing function in this embodiment of the application) is located in the cloud.
  • the historical voice The instruction set also resides in the cloud.
  • the first electronic device 10 has an audio collection device or module (such as a microphone) that can receive the user's voice.
  • the first electronic device 10 may also establish a communication connection with the second electronic device 20 .
  • the second electronic device 20 may not be equipped with a voice assistant and does not have the ability to process voice commands. Therefore, it needs to be connected to other electronic devices (such as the first electronic device 10) so that the other electronic devices can process the voice commands and receive the voice commands based on the other voice devices. and respond to control commands sent by voice commands.
  • the second electronic device 20 may also have voice processing capabilities.
  • the first electronic device 10 When the first electronic device 10 receives the voice command in the form of audio data, it recognizes the received voice command in the form of audio data based on the ASR function, obtains the voice command in the form of text, and uses the NLU function to recognize the voice command in the form of text. Perform analysis. If the voice command does not include a slot or only uses demonstrative pronouns to indicate the slot, the voice assistant sends a request message to the cloud; based on the request message, the cloud obtains information related to the current voice command based on the historical voice command collection stored in the cloud. historical voice commands, and sends the relevant data of the relevant historical voice commands to the first electronic device 10 so that the voice assistant in the first electronic device 10 can complete the current voice command in the slot of the historical voice commands. Afterwards, the voice assistant can execute the voice command. Among them, the way the voice assistant executes voice commands can be found in the relevant instructions in the scene.
  • FIG. 5 exemplarily shows a schematic system architecture diagram of scenario five in the embodiment of the present application.
  • the ASR function and NLU function in the voice assistant are located in the cloud.
  • the function of replenishing slots for voice commands with missing slots (referred to as the processing function in this embodiment) is also located in the cloud.
  • the cloud also stores history. Collection of voice commands.
  • the first electronic device 10 has an audio collection device or module (such as a microphone) that can receive the user's voice.
  • the first electronic device 10 may also establish a communication connection with the second electronic device 20 .
  • the first electronic device 10 When the first electronic device 10 receives the voice command in the form of audio data, it sends the voice command in the form of audio data to the cloud; the cloud recognizes the received voice command in the form of audio data based on the ASR function to obtain the voice in text form. command, and parses the text-form voice command based on the NLU function. If the slot is not included in the voice command or only a demonstrative pronoun is used to indicate the slot, historical voice commands related to the current voice command are obtained based on the historical voice command collection stored in the cloud, and the relevant The relevant data of the historical voice commands is sent to the first electronic device 10, so that the voice assistant in the first electronic device 10 can complete the current voice command using the slot of the historical voice commands. Afterwards, the voice assistant can execute the voice command. Among them, the way the voice assistant executes voice commands can be found in the relevant instructions in the scene.
  • the ASR functionality may be located on the electronic device side.
  • FIG. 6 is a schematic diagram of the internal hardware structure of the electronic device 100 provided by the embodiment of the present application.
  • the electronic device 100 may be the electronic device in each of the scenarios shown in FIG. 1 , FIG. 2 and FIG. 3 .
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C. Further, it may also include a headphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identification module (subscriber identification module, SIM) card interface 195, etc.
  • SIM subscriber identification module
  • Processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), modem processor, graphics processing unit (GPU), image signal processor (image signal processor, ISP), controller, video Codec, digital signal processor (digital signal processor, DSP), baseband processor, display processing unit (display process unit, DPU), and/or neural network processor (neural-network processing unit, NPU), etc.
  • different processing units can be independent devices or integrated in one or more processors.
  • electronic device 100 may also include one or more processors 110 .
  • the processor is the nerve center and command center of the electronic device 100 .
  • the processor can generate operation control signals based on the instruction opcode and timing signals to complete the control of fetching and executing instructions.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • processor 110 may include one or more interfaces.
  • Interfaces may include integrated circuit (inter-integrated circuit, I2C) interface, integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, pulse code modulation (pulse code modulation, PCM) interface, universal asynchronous receiver and transmitter (universal asynchronous receiver/transmitter (UART) interface, mobile industry processor interface (MIPI), general-purpose input/output (GPIO) interface, subscriber identity module (SIM) interface, and /or universal serial bus (USB) interface, etc.
  • the USB interface 130 is an interface that complies with USB standard specifications, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc.
  • the USB interface 130 can be used to connect a charger to charge the electronic device 100, and can also be used to transmit data between the electronic device 100 and peripheral devices. It can also be used to connect headphones to play audio through them.
  • the sensor module 180 may include one or more of the following: pressure sensor 180A, gyro sensor 180B, air pressure sensor 180C, magnetic sensor 180D, acceleration sensor 180E, distance sensor 180F, proximity light sensor 180G, fingerprint sensor 180H, temperature sensor 180J , touch sensor 180K, ambient light sensor 180L, bone conduction sensor 180M.
  • the charging management module 140 is used to receive charging input from the charger.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the wireless communication function of the electronic device 100 can be implemented through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • antenna 1 can be multiplexed as a diversity antenna of a wireless local area network. In other embodiments, antennas may be used in conjunction with tuning switches.
  • the mobile communication module 150 can provide solutions for wireless communication including 2G/3G/4G/5G applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation.
  • at least part of the functional modules of the mobile communication module 150 may be disposed in the processor 110 .
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used to modulate the low-frequency baseband signal to be sent into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low-frequency baseband signal.
  • the demodulator then demodulates the low-frequency baseband signal Sent to baseband processor for processing.
  • the application processor outputs sound signals through audio devices (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194.
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent of the processor 110 and may be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as Wi-Fi network), Bluetooth (bluetooth, BT), global navigation satellite system (global navigation satellite system, GNSS) ), frequency modulation (FM), near field communication (NFC), infrared (IR) and other wireless communication solutions.
  • WLAN wireless local area networks
  • Bluetooth blue, BT
  • global navigation satellite system global navigation satellite system
  • FM frequency modulation
  • NFC near field communication
  • IR infrared
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110, frequency modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
  • the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • the electronic device 100 can implement display functions through a GPU, a display screen 194, an application processor AP, and the like.
  • the display screen 194 is used to display images, videos, etc.
  • the electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or fewer components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.
  • FIG. 7 is a schematic diagram of the software structure of the electronic device in the embodiment of the present application.
  • the electronic device may be an electronic device in various scenarios shown in FIG. 1, FIG. 2, FIG. 3, FIG. 4 or FIG. 5.
  • the software architecture of the electronic device 100 may include an application layer 701, a framework layer 702, native libraries (native libraries) & Android Runtime (Android Runtime) 703, a hardware abstraction layer (HAL) 704 and a kernel. 705.
  • the embodiment of this application is explained by taking the operating system of the electronic device 100 as an Android system as an example.
  • the electronic device 100 may also be a Hongmeng system, an IOS system, or other operating systems, which are not limited in the embodiments of this application.
  • Application layer 701 may include voice assistant 7011.
  • the application layer 701 may also include other applications.
  • the other applications may include cameras, galleries, calendars, calls, maps, navigation, music, videos, short messages, etc., which are not limited by the embodiments of this application. .
  • the voice assistant 7011 can control the electronic device 100 to communicate with the user through voice.
  • the voice assistant 7011 can access the historical voice command set to obtain historical voice commands related to the current voice command according to the historical voice command set, and thereby determine the missing slot in the current voice command based on the slot of the historical voice command.
  • the historical voice command collection may be located locally on the electronic device 100 or may be located in the cloud. If located in the cloud, the voice assistant 7011 can access the historical voice command collection in the cloud through the communication module of the electronic device 100, and can also request the cloud to query historical voice commands related to the current voice command.
  • the framework layer 702 may include an activity manager, a window manager, a content provider, a resource manager, a notification manager, etc. This embodiment of the present application does not impose any limitation on this.
  • Native libraries & Android runtime 703 including core libraries and virtual machines.
  • Android runtime is responsible for the scheduling and management of the Android system.
  • the core library contains two parts: one is the functional functions that need to be called by the Java language, and the other is the core library of Android.
  • the application layer and application framework layer run in virtual machines.
  • the virtual machine executes the java files of the application layer and application framework layer into binary files.
  • the virtual machine is used to perform object life cycle management, stack management, thread management, security and exception management, and garbage collection and other functions.
  • HAL704 can include microphones, speakers, WIFI modules, Bluetooth modules, cameras, sensors, etc.
  • Kernel 705 is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • modules included in each layer shown in Figure 7 are modules involved in the embodiments of the present application, and the modules included in each layer mentioned above do not constitute a hierarchy for the structure and module deployment of the electronic device (example description). ) limit.
  • the modules shown in Figure 7 can be deployed individually, or several modules can be deployed together.
  • the division of modules in Figure 7 is an example.
  • the names of the modules shown in Figure 7 are examples.
  • Figure 8a exemplarily shows a schematic diagram in which all functions of the voice assistant are deployed in an electronic device.
  • the electronic device has a voice reception function and a voice output function.
  • a voice assistant is installed in the electronic device.
  • the voice assistant may include a management module 801, an ASR module 802, an NLU module 803, a processing module 804, and an execution module 805. Furthermore, it may also include a natural language generation (NLG) module. 806. Text to speech (TTS) module 807.
  • NLU natural language generation
  • TTS Text to speech
  • the management module 801 is used to collect voice signals (which may also be called voice data, audio data of voice instructions, audio signals, etc., referred to as voice for short) through the microphone, and play the audio data that needs to be output through the speaker.
  • voice signals which may also be called voice data, audio data of voice instructions, audio signals, etc., referred to as voice for short
  • the ASR module 802 is used to convert the audio data of the voice command into text data.
  • the voice input by the user through the microphone "Xiaoyi, please play music” is converted into text data that can be recognized by the application (which can be called text information, or text for short).
  • the NLU module 803 is used to parse the text data of the voice command to obtain the user's intention and slot. For example, from the text data "Xiaoyi, please play music", the user's intention is identified as "play music”.
  • the processing module 804 is used to obtain the historical voice instructions related to the current voice instructions based on the voice history data set when the slot is missing in the data structure of the voice instructions parsed by the NLU module 803. According to the slot of the historical voice instructions , replenish the current voice command slot. Further, the complete voice command can be provided to the execution module 805, or sent to the NLU module 803, and then the NLU module 803 sends it to the execution module 805. In another possible implementation, the processing module 804 can also send the slot to be supplemented to the first voice command to the NLU module 803, and the NLU module 803 completes the first voice command and sends it to the execution module 805.
  • the execution module 805 is used to execute corresponding tasks on the device based on the parsed user intention and slot, that is, to respond according to the user intention. For example, when the user's intention is to "play music" and the execution device in the slot is a "smartphone", the execution module 805 can call the music player of this device (smartphone) to play music. In other scenarios, if the parsed slot indicates that the executor of the user intention is a cloud service, the execution module 805 can call the corresponding cloud service to execute the user intention.
  • the execution module 805 can generate a generated code that can be recognized by the other electronic devices based on the user's intention and the slot. control instructions, and send the control instructions to the other electronic devices through the communication module of the electronic device 100 .
  • the execution module 805 can also send the execution result of the voice instruction (structured data form) to the NLG module 806.
  • the NLG module 806 is used to generate an execution result in natural language form based on the execution result (structured data) of the voice instruction, such as generating a natural language reply message "Okay, the music has been played" based on the structured data of the reply message.
  • the TTS module 807 is used to convert the text data of the execution result from the NLG module 806 into audio data, and can send the audio data of the execution result to the management module 801.
  • the management module 801 can send the audio data of the execution result to the speaker. Play.
  • the processing module 804 stores the current voice command into the historical voice command set after the current voice command is completed.
  • the processing module 804 can also share or synchronize the historical voice command set in the current device with the historical voice data set in other electronic devices.
  • FIG 8b is a schematic diagram of the deployment of another voice assistant provided by an embodiment of the present application.
  • the processing module 804 in the voice assistant is deployed in the cloud
  • the historical voice command collection is also deployed in the cloud
  • other functional modules are deployed in the electronic device.
  • FIG. 8c is a schematic diagram of the deployment of another voice assistant provided by an embodiment of the present application.
  • the management module 801, ASR module 802, NLU module 803, and processing module 804 in the voice assistant are deployed in the cloud, and the historical voice command collection is also deployed in the cloud.
  • NLG module 806 and TTS module 807 can also be deployed in the cloud.
  • the electronic device has a voice reception function and a voice output function, and can receive audio data of voice instructions and output the voice of the execution result of the voice instructions.
  • FIG. 9 takes a specific application scenario as an example to show a schematic execution logic diagram of the voice command processing method provided by the embodiment of the present application.
  • the voice assistant when the user uses the voice assistant, the following process is used: first, the user speaks "tune up" to the electronic device equipped with the voice assistant, the voice assistant converts the voice into text, and then the NLU module of the voice assistant Natural language processing is performed on the text, and the user's intention is extracted: to brighten the device. Due to the missing slot (execution device), the voice assistant starts the following conversation status tracking process:
  • the voice assistant obtains the historical voice command set, matches the current round of dialogue (i.e., the current voice command) with the historical voice commands in the historical voice command set, obtains the historical round of dialogue related to the current round of dialogue, and matches the historical round of dialogue in the relevant historical round of dialogue.
  • the execution device information is added to the slot of this round of dialogue.
  • the completed dialogue of this round is:
  • the voice assistant converts the voice command that completes the current round of dialogue into a control command that can be recognized and executed by the execution device "Lamp”, and sends the control command to the execution device "Light" so that the execution device executes the control command .
  • the voice assistant After the voice assistant completes the corresponding voice command of this round, it also starts the following dialogue strategy process: stores the completed voice command into the historical voice command collection.
  • the collection of historical voice commands can be located locally on the electronic device or in the cloud.
  • the voice assistant receives the execution success information fed back by the "light" of the execution device, it generates the data structure of the reply voice:
  • the NLG module in the voice assistant generates the natural language "Okay, the lights have been turned on" based on the data structure of the reply voice, converts the natural language into audio data through the TTS module, and plays the audio data.
  • the voice assistant can obtain the historical voice command related to the voice command based on the historical voice command collection, and add the slot of the historical voice command as the current voice command. In the voice command, the current voice command is complete.
  • the function of the voice assistant is deployed in the electronic device, that is, this process can be applied to the architecture shown in Figure 8a.
  • the functions of the voice assistant can also be deployed in the cloud, that is, the process can also be applied to the architecture shown in Figure 8c.
  • the process can include the following steps:
  • Step 1 The audio module in the electronic device receives the user's voice, obtains the audio data of the voice command, and sends the audio data of the voice command to the management module in the voice assistant.
  • the user before using the voice assistant, the user first logs in to the voice assistant using a user account.
  • the audio module (such as a microphone) of the electronic device is activated.
  • the microphone receives the user's voice and obtains the audio data of the voice command.
  • Step 2a After receiving the audio data of the voice command, the management module in the voice assistant sends the audio data of the voice command to the ASR module in the voice assistant.
  • the management module can also cache device information, location information, user account information, time information, user identity information and other information, such as caching to a designated area in the memory or notifying relevant parties of the cache address after caching.
  • Functional modules such as processing modules, so that the processing module can obtain this information and use this information as the associated information corresponding to the current voice command, and as the basis for obtaining historical voice commands related to the current voice command.
  • the device information is information about the device that received the voice instruction.
  • the device information may include device type, device identification (deviceID), device status, device name, etc.
  • the location information is the location information of the device receiving the current voice command.
  • the user account information is the account information of the user who logs in to the voice assistant.
  • the user account information may include a user identification.
  • the time information is the current time.
  • the user identity information may be feature information (voiceprint) of audio data, which may be obtained by feature extraction of audio data of received voice instructions.
  • the user identity information may also be user identification and other information that can indicate the user's identity (for example, it may be the user's role as a member of the family).
  • the user list After extracting the feature information of the audio data of the voice command, the user list can be queried through the feature information. (The corresponding relationship between voiceprint and user identification is recorded in the user list) and the corresponding user identification is obtained.
  • One or more of the device information, user account information, time information, and user identity information can be used as associated information corresponding to the current voice command. Subsequent voice command processing.
  • Step 3 The ASR module in the voice assistant recognizes the audio data of the voice command and obtains the text data of the voice command.
  • the ASR module can use the ASR algorithm to identify the audio data of the voice command.
  • the embodiment of the present application does not limit the ASR algorithm.
  • Step 4 The ASR module in the voice assistant sends the text data of the voice command to the NLU module.
  • Step 5 The NLU module in the voice assistant parses the text data of the voice command to obtain the user's intention.
  • step 10 is performed to convert the structured data of the voice command (including user intention and slot ) is sent to the execution module; if the user intention is parsed but the slot is not parsed, or the slot is referred to by a demonstrative pronoun, it indicates that the semantics of the voice command are incomplete. In this case, perform subsequent step 6 ⁇ 9.
  • the NLU module can use the NLU algorithm to parse the text data of the voice command.
  • the embodiment of this application does not limit the NLU algorithm.
  • Step 6 The NLU module in the voice assistant sends the first instruction information to the processing module to instruct the processing module to obtain the missing slot in the voice command.
  • the first indication information may include user intent parsed by the NLU module.
  • Step 7 The processing module in the voice assistant obtains relevant historical voice commands by querying the historical voice command collection according to the first instruction information, and further obtains the slot of the historical voice command.
  • the processing module can obtain the previously cached association information corresponding to the current voice command, and use the association information as one of the basis for obtaining the matching historical voice command.
  • the processing module can further store the voice command, together with the associated information corresponding to the voice command, into the historical voice command set, so that it can be used in subsequent rounds of dialogue based on This collection of historical voice instructions processes voice instructions with missing semantics.
  • the historical voice instructions in the historical voice instruction set can be structured data.
  • the first voice instruction can be used.
  • the structured data of historical voice commands includes user intent and slot, which is described using natural language, such as "air conditioner".
  • the structured data can be parsed by the NLU module in the voice assistant. In the case where part of the structured data parsed by the NLU module is missing (such as a missing slot), the missing part (such as a slot) can be completed and the completed structured data can be stored in the historical voice command collection.
  • An example of structured data for historical voice commands is:
  • Slot 1 (location): living room
  • Slot 2 (execution equipment): air conditioner.
  • the slots in the structured data can also be the underlying protocol parameters obtained through the underlying mapping.
  • the above-mentioned "Living room” and "air conditioner” are mapped to a device identifier (deviceID) used to uniquely identify the electronic device (air conditioner).
  • deviceID device identifier
  • the slots obtained based on the historical voice command set are the underlying protocol parameters. No parsing or feature extraction is required.
  • the underlying protocol parameters are directly added to the voice with incomplete semantics (missing slots). Just in the command.
  • An example of structured data for historical voice commands is:
  • deviceID_air-conditioning_01 represents the device ID of the air conditioner located in the living room.
  • the historical voice instructions in the historical voice instruction set can be text data.
  • the text of the first voice instruction can be used.
  • the data is matched with the text data of historical voice commands.
  • the text data of historical voice commands may be recognized by the ASR module in the voice assistant based on the audio data of the voice commands. It may also be obtained by supplementing the semantics of the voice command when the semantics of the voice command recognized based on the ASR module are incomplete.
  • An example of text data of historical voice commands is "Turn up the temperature of the air conditioner in the living room.”
  • the historical voice command set may also include associated information corresponding to the historical voice command.
  • the associated information corresponding to a historical voice command may include information in one or more dimensions.
  • the associated information corresponding to a historical voice command may include at least one of the following information: time information, device information of the receiving device, location information of the receiving device, user account information, user identity information, etc. The associated information corresponding to a historical voice command is collected when the historical voice command is received.
  • the time information in the associated information corresponding to a historical voice command can be the time when the historical voice command was received or the time when the voice command was processed (such as the time after ASR processing of the voice command, or the time after the voice command was processed).
  • This voice command performs NLU processing time
  • the device information in the associated information is the device information of the receiving device of the historical voice command.
  • the device information may include device type, device identification (deviceID), device status, device name, etc.
  • the user account information in the associated information is used to indicate the login of the voice assistant when receiving and processing the historical voice instructions.
  • User the user identity information in the associated information is used to indicate the user who issued the historical voice command.
  • the voice assistant receives and processes the voice commands.
  • the voice commands For explanation and description of the associated information corresponding to the voice command, please refer to the relevant description in step 2.
  • Table 1 Collection of historical voice commands.
  • the associated information corresponding to a historical voice command includes: user account information, user identity information, receiving device, and time.
  • the associated information corresponding to a voice command can also include the time interval between the voice command and the previous voice command received, the conversation location, semantic correlation and other information. Furthermore, it can also include the user's physiological characteristics. Characteristics, peripheral device list of the device receiving the current voice command, number of dialogue rounds and other information will not be listed one by one here.
  • the voice assistant can collect the above information to obtain the associated information corresponding to the voice command.
  • the associated information can be stored together with the voice command in a historical voice command collection.
  • the voice assistant parses the voice command and obtains the type of the execution device. If it is found that there are multiple devices of this type, the voice assistant uses voice prompts to ask the user to Multiple devices for clarification or selection; after the voice assistant obtains the execution device selected by the user, on the one hand, it generates control instructions based on the voice instructions and sends the control instructions to the execution device for execution. On the other hand, it updates the voice according to the execution device selected by the user.
  • the structured data of the command such as updating the execution device information of the voice command, and storing the updated structured data of the voice command in the historical voice command collection.
  • FIG. 11 shows a schematic diagram of a clarification execution device through multiple rounds of dialogue.
  • the voice assistant receives the user's voice saying "turn up the air conditioner temperature", it recognizes the voice as text, parses the file, and obtains the structured data 1 of the voice command:
  • the voice assistant determines that the type of execution device is an air conditioner, queries the registered devices of this type, and finds that there are multiple candidate execution devices such as "air conditioner in the bedroom” and “air conditioner in the living room", and then outputs a prompt voice to the user "You have multiple air conditioners” , which one should be adjusted: 1. The air conditioner in the bedroom, 2. The air conditioner in the living room, 3.
  • the voice assistant After the voice assistant receives the user's voice "living room” used to select the execution device, it converts the voice into text, parses the text, and updates the structured data 1 of the above voice command according to the parsed execution device, and obtains the updated Structured Data 2 or Structured Data 3 of this voice command:
  • deviceID_air-conditioning_001 in structured data 3 represents the device identification of the air conditioner located in the living room, which is obtained by mapping "air conditioner” and "living room” described in natural language.
  • the above-mentioned structured data 2 or structured data 3 of the voice command will be stored in the historical voice command collection.
  • the voice assistant parses the voice command. If the execution device is missing from the voice command, the voice assistant can select the default execution device as the execution device for the voice command. If the voice assistant determines that the execution of the voice command failed based on the execution result, and determines that the cause of the failure is an execution device error, the voice assistant can initiate a dialogue with the user and ask the user for the execution device. After the voice assistant obtains the execution device, updates the structured data of the voice command, sends the corresponding control command to the execution device, and receives a response of successful execution, the voice assistant can store the updated structured data of the voice command in the history. Voice command collection.
  • historical voice instructions whose time interval from the current time exceeds a set threshold can be cleared from the historical voice instruction set.
  • the historical voice command set is stored locally on the electronic device, and the voice commands in the historical voice command set only include voice commands received and processed by the device.
  • the historical voice command set is stored locally on the electronic device.
  • the voice commands in the historical voice command set include voice commands received and processed by the device, and may also include voice commands received and processed by other electronic devices. voice commands.
  • Communication connections can be established between multiple electronic devices to synchronize historical voice commands with each other. These electronic devices can synchronize historical voice commands according to a set cycle or a set time, or when one of the electronic devices receives and processes the voice command, the electronic device uses the voice command as a historical voice command and synchronizes it to other devices.
  • Electronic equipment is accessed from a set cycle or a set time, or when one of the electronic devices receives and processes the voice command, the electronic device uses the voice command as a historical voice command and synchronizes it to other devices.
  • these electronic devices that synchronize with each other or share historical voice commands can be pre-configured as a group, and the group information can be configured on these electronic devices respectively.
  • the group information can include the group identifier and the information of each member device in the group. Information such as identification or address.
  • Member devices in a group can establish connections between each other based on group information, or can establish communication connections based on user operations, such as establishing a Bluetooth connection in response to the user's Bluetooth pairing operation.
  • all or part of the electronic devices in an area can be divided into a group.
  • Electronic devices within a group may include different types of electronic devices, such as smart speakers, smart TVs, smart phones, etc.
  • member devices within a group may be electronic devices associated with the same user account information.
  • the electronic devices associated with the same user account can be divided into a group.
  • the historical voice command collection is stored on the network side, such as in the cloud.
  • the storage address of the historical voice command collection in the cloud can be configured in the electronic device, so that the electronic device can store the historical voice command in the historical voice command collection in the cloud.
  • the electronic device may query the historical voice command collection in the following ways:
  • Method 1 When the electronic device has a historical voice command set stored locally, the locally stored historical voice command set can be directly queried;
  • Method 2 When the historical voice command set is stored in the cloud, the electronic device can interact with the cloud to query the historical voice command set, such as sending a query request.
  • the query request can carry query conditions or keywords, such as user intention, Time range, user ID, etc., to obtain historical voice records that meet the query conditions.
  • the process of the processing module obtaining historical voice instructions related to the current voice instruction from the historical voice instruction set may include: determining the historical voice instructions in the historical voice instruction set and the current voice instruction (i.e., the current The correlation degree of the received voice command of the missing slot (that is, the "first voice command" in the embodiment of the present application) is based on the correlation between the current voice command and each historical voice command. From the historical voice command set, the current voice command is selected. Historical voice commands related to voice commands, for example, you can select the most relevant historical voice commands.
  • the processing module can calculate the correlation based on the following information:
  • a correlation score can be used to characterize the degree of correlation.
  • the calculation process of the relevance score may include: first, after the electronic device receives the current voice command, obtain the associated information corresponding to the voice command, and then, through preset rules, The associated information corresponding to the current voice command is matched with the associated information corresponding to the historical voice command to obtain a correlation score between the current voice command and the historical voice command.
  • the associated information may include information in one or more dimensions.
  • the associated information may include one or more of the following information: information about the electronic device (for example, including device name, device type, device status, etc.), user account information (that is, the user account information for logging into the voice assistant), user identity information (the user identity information is associated with the characteristic information of the audio data of the voice command), the reception time of the voice command, and the received upper The time interval between a voice command, the conversation location, semantic relevance and other information, the user's physiological characteristics, the list of peripheral devices of the device receiving the current voice command, the number of dialogue rounds, etc.
  • some regular methods can be used to calculate the matching degree of the current voice command and each historical voice command in these dimensions. Based on the matching degree in each dimension, the correlation score is calculated comprehensively.
  • the calculation process of the relevance score may include: first, encoding the currently received voice command and the associated information corresponding to the voice command, for example, encoding the historical conversations in the conversation scene on each device.
  • Information (including input text, device status, etc.) is uniformly encoded using a natural language encoder to encode the associated information of the historical round and the current round; then, the encoding results are input into the deep learning neural network for inference, and the current round is obtained. Correlation score between voice commands and historical voice commands.
  • the voice assistant can initiate a conversation with the user to obtain the missing part of the voice instructions. For example, the voice assistant generates a Prompt information for asking the slot or guiding the user to give the slot or guiding the user to give the complete voice command, convert the prompt information into audio data, and output the prompt information through the audio module to guide the user to give the complete voice command.
  • Step 8 The processing module in the voice assistant will obtain the slot and send it to the NLU module, or send the supplemented voice command to the NLU module.
  • the NLU module parses out the slot in the historical voice command and adds the slot to the text data of the currently received voice command.
  • the NLU module can directly obtain the slots from the structured data and add This slot is added to the structured data of the currently received voice command.
  • the NLU module can directly convert the The mapped underlying protocol parameters are added to the structured data of the currently received voice command. In this case, the slot can no longer be mapped in subsequent steps, saving processing overhead.
  • the NLU module adds the slot to the structure of the currently received voice command. After converting the data, the demonstrative pronouns in the voice command can be deleted.
  • Step 9 The NLU module in the voice assistant sends the structured data of the voice command (including user intention and slot) to the execution module.
  • the NLU module can map the slot to an underlying protocol parameter, such as a device identifier, and send the mapped slot parameter to Execution module.
  • Step 11 After receiving the user intention and slot of the voice command, the execution module in the voice assistant performs related processing operations for executing the voice command.
  • the execution module can call the corresponding application or service or the corresponding function on the device to execute the user intention, or generate a control instruction that the execution device can recognize and execute based on the voice instruction (including user intention and slot), and Send the control instruction to the execution device.
  • control instructions may include general command information, such as command type, command parameters, etc., and also include device information of the execution device.
  • the execution module does not need to perform mapping. Otherwise, the execution module needs to map the slots described in natural language to the underlying protocol. parameter.
  • Step 12a and step 12b The execution module in the voice assistant sends the execution result of the voice instruction to the processing module and NLG module.
  • the execution module can directly obtain the execution result of the voice command; if the voice command is executed by other electronic devices, the execution module can receive the voice command from the execution device of the voice command. The execution result of the instruction.
  • Step 13 The processing module in the voice assistant determines that the voice command is successfully executed based on the execution result, and then stores the voice command, together with the associated information corresponding to the voice command, into the historical voice command collection.
  • the historical data module can store the voice instructions in the historical voice instruction set according to the format requirements.
  • the data requirements for the historical voice commands in the historical voice command set can be found in the previous description.
  • Step 14 The NLG module in the voice assistant converts the execution results into natural language and obtains the text data of the execution results.
  • Step 15 The NLG module in the voice assistant sends the text data of the execution result to the TTS module.
  • Step 16 The TTS module in the voice assistant converts the text data of the execution result into audio data.
  • Step 17 The TTS module in the voice assistant sends the audio data of the execution result to the management module.
  • Step 18 The management module in the voice assistant sends the audio data of the execution result to the audio module (such as a speaker), so that the audio module outputs the corresponding voice to notify the user of the execution result of the voice command.
  • the audio module such as a speaker
  • the processing module can also send the complete voice command to the execution module without sending it to the NLU module.
  • the voice assistant after receiving the audio data of the voice command, the electronic device will send the audio data to the cloud, and the voice assistant in the cloud will process it accordingly.
  • the electronic device can also send corresponding associated information (such as device information, location information, etc. of the electronic device) to the cloud. This information can be cached in the cloud so that the processing module can perform corresponding processing operations based on this information.
  • the management module can obtain the corresponding user identity information based on the audio data, and can also obtain the user account information currently logged in to the voice assistant, and cache this information for processing. The module performs corresponding processing operations based on this information.
  • the voice assistant in the cloud may or may not include an execution module. If the voice assistant in the cloud includes an execution module, optionally, the execution module can convert the voice command into a control command that can be recognized by the execution device, and send the control command to the electronic device (i.e., the receiving device of the voice command) ), so that the control instruction is executed by the electronic device (if the electronic device is the execution device of the voice instruction), or is sent to the execution device by the electronic device (if the electronic device is not the execution device of the voice instruction). Of course, the cloud The control instruction can also be sent to the execution device, so that remote control can be achieved. If the voice assistant in the cloud does not include an execution module, optionally, the cloud can send the complete voice command to the electronic device (i.e., the receiving device of the voice command), and the execution module in the receiving device will process it accordingly. .
  • step 2a can also be executed after step 2b.
  • steps 12b to 18 may not be executed.
  • the missing part of the current voice command can be determined based on the historical voice command collection, thereby supplementing the current voice command to be semantically complete
  • the voice command can be executed, and on the other hand, it can improve the user experience.
  • the voice assistant in the electronic device can request the cloud to obtain historical voice commands related to the voice command based on the historical voice command collection, and the voice assistant obtains the relevant After the historical voice command is given, the slot of the historical voice command is added to the current voice command to make the current voice command complete.
  • the voice assistant in the cloud can obtain the historical voice command related to the voice command based on the historical voice command collection. After obtaining the relevant historical voice command, Add the slot of the historical voice command to the current voice command to make the current voice command complete.
  • the missing part of the current voice command can be determined based on the historical voice command collection, thereby supplementing the current voice command. It is a voice command with complete semantics. On the one hand, it can ensure that the voice command can be executed, and on the other hand, it can improve the user experience. Since the historical voice record collection is located in the cloud, and multiple electronic devices can share the historical voice record collection, cross-device voice connection can be achieved. On the other hand, having the cloud perform voice connection (that is, query relevant historical voice records to obtain the missing part of the current voice command) can reduce the processing overhead on the terminal side.
  • Figure 10, Figure 12, and Figure 13 only describe the processing process of voice instructions based on some possible deployment methods of the voice assistant. When the deployment methods of each function in the voice assistant adopt other methods, the voice instructions The processing procedures will be adjusted accordingly and will not be listed one by one here.
  • Example 1 The scenario in Example 1 is: the user first says to the mobile phone, "Play a movie on the TV in the living room”, and then says to the speaker, "Only action movies.”
  • the mobile phone After the mobile phone receives the voice command "Play a movie on the living room TV”, it sends a control command to the living room TV, so that the living room TV responds to the control command and displays the main interface of the movie; after the speaker receives the voice command "Only action movies", it sends a control command to the living room TV.
  • Send a control instruction so that the living room TV responds to the control instruction and selects an action movie in the main movie interface for playback.
  • Figure 14a shows a system architecture for the above scenario.
  • the mobile phone and the speaker share a historical voice command set.
  • the mobile phone and the speaker locally store a historical voice command set respectively.
  • the mobile phone and the speaker can interact to realize the historical voice command set. synchronization.
  • the voice assistant in the speaker can Through correlation matching, it is determined that the voice command "Play a movie on the living room TV” is related to the current voice command "Only action movies", and then it is determined that the execution device of the current voice command "Only action movies” is the living room TV.
  • Figure 14b shows another system architecture for the above scenario.
  • the mobile phone and the speaker do not store the historical voice command set locally.
  • the historical voice command set is stored in the cloud. Both the mobile phone and the speaker can access the historical voice command set.
  • the voice assistant in the speaker can Through correlation matching, it is determined that the voice command "Play a movie on the living room TV” is related to the current voice command "Only action movies”, and then it is determined that the execution device of the current voice command "Only action movies” is the living room TV.
  • Example 1 the specific implementation method of the mobile phone and the speaker processing the voice command can be referred to the foregoing embodiments.
  • Example 2 The scenario in Example 2 is: the user first says to the TV, "Play Andy Lau's movies", and then says to the tablet, "Only action movies.” After the TV receives the voice command "Play Andy Lau's movie", since the voice command does not contain execution device information, the voice assistant in the TV can select the default execution device (for example, this TV is the default execution device) to respond to the voice command. , displaying a list of movies starring Andy Lau; after receiving the voice command "Only action movies", the tablet computer sends a control command to the TV, causing the TV to respond to the control command and select an action movie in the movie list for playback.
  • the voice assistant in the TV can select the default execution device (for example, this TV is the default execution device) to respond to the voice command.
  • the tablet computer sends a control command to the TV, causing the TV to respond to the control command and select an action movie in the movie list for playback.
  • Figure 15a shows a system architecture for the above scenario, where the TV and the tablet share a collection of historical voice commands.
  • the voice assistant in the tablet computer Through correlation matching, it can be determined that the voice command "play Andy Lau's movies” is related to the current voice command "only action movies", and then determine the execution device of the current voice command "only action movies” and the voice command "play Andy Lau's movies” The execution equipment is the same.
  • FIG 15b shows another system architecture for the above scenario.
  • the TV and tablet do not store the historical voice command set locally, but the historical voice command set is stored in the cloud. Both the TV and the tablet can access the historical voice command set.
  • the voice assistant in the tablet computer Through correlation matching, it can be determined that the voice command "play Andy Lau's movies" is related to the current voice command "only action movies", and then determine the execution device of the current voice command "only action movies” and the voice command "play Andy Lau's movies” The execution equipment is the same.
  • Example 3 The scenario in Example 3 is: the user first says to the speaker "Play Andy Lau's movies", and then says to the tablet "Only action movies".
  • the voice assistant can select the default execution device (for example, the living room TV is the default execution device for video playback operations) to respond to the Voice command, the living room TV displays a list of movies starring Andy Lau; after receiving the voice command "Only action movies", the tablet computer sends a control command to the living room TV, so that the living room TV responds to the control command and selects the action in the movie list movie to play.
  • the voice assistant can select the default execution device (for example, the living room TV is the default execution device for video playback operations) to respond to the Voice command, the living room TV displays a list of movies starring Andy Lau; after receiving the voice command "Only action movies", the tablet computer sends a control command to the living room TV, so that the living room TV responds to the control command and selects the action in the movie list movie to play.
  • the speaker and tablet share a collection of historical voice commands.
  • the voice assistant in the tablet computer can pass Correlation matching determines that the voice command "play Andy Lau's movies” is related to the current voice command "only action movies", and then determines the execution device of the current voice command "only action movies” and the execution device of the voice command "play Andy Lau's movies” The equipment is the same.
  • the speaker and tablet computer do not store the historical voice command set locally, but the historical voice command set is stored in the cloud, and both the speaker and the tablet computer can access the historical voice command set.
  • the voice assistant in the tablet computer can match it through correlation , determine that the voice command "play Andy Lau's movies” is related to the current voice command "only action movies", and then determine that the execution device of the current voice command "only action movies” is the same as the execution device of the voice command "play Andy Lau's movies”.
  • Embodiments of the present application also provide a computer-readable storage medium.
  • the computer-readable storage medium is used to store a computer program.
  • the computer program When the computer program is executed by a computer, the computer can implement the method provided by the above method embodiment.
  • Embodiments of the present application also provide a computer program product.
  • the computer program product is used to store a computer program.
  • the computer program When the computer program is executed by a computer, the computer can implement the method provided by the above method embodiment.
  • An embodiment of the present application also provides a chip, which includes a processor.
  • the processor is coupled to a memory and is configured to call a program in the memory so that the chip implements the method provided by the above method embodiment.
  • embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种语音指令处理方法,该方法包括:语音助手获取第一语音指令;确定第一语音指令的意图,并根据第一语音指令的意图确定第一语音指令缺失槽位;获取历史语音指令集合中的第二语音指令,第二语音指令与第一语音指令相关;根据第二语音指令的槽位确定第一语音指令的槽位。采用该方法可以在语音指令语义缺失的情况下,确定该语音指令的完整语义,从而可以使得该语音指令能够被执行。还提供了一种语音指令处理系统、电子设备以及计算机可读存储介质。

Description

一种语音指令处理方法、装置、系统以及存储介质
本申请要求在2022年7月1日提交中国国家知识产权局、申请号为202210775676.2的中国专利申请的优先权,发明名称为“一种语音指令处理方法、装置、系统以及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及语音控制技术领域,尤其涉及一种语音指令处理方法、装置、系统以及存储介质。
背景技术
随着信息技术的飞速发展,语音控制作为人机交互的一种形式应用得越来越广泛。
目前,语音控制通常是通过语音助手实现的,用户可以通过语音助手输入语音指令,然后语音助手根据用户输入的语音指令控制电子设备执行与语音指令相对应的操作。尤其在智能家居领域,语音助手可以作为智能家居的控制端口,直接通过语音对话,自动控制智能设备,方便用户对各种设备的使用。
当前的语音助手对语音指令的处理缺乏多轮接续的能力,要求用户每次输入的语音指令提供完整信息,比如需要包括意图和槽位,若用户输入的语音指令语义不完整,比如缺少执行设备,则语音助手无法理解该语音指令,该语音指令无法被执行。
发明内容
本申请实施例提供一种语音指令处理方法、装置、系统以及存储介质,用以在语音指令语义缺失的情况下,确定该语音指令的完整语义,从而可以使得该语音指令能够被执行。
第一方面,提供一种语音指令处理方法,该方法可以由语音助手执行。所述方法包括:获取第一语音指令,确定所述第一语音指令的意图,并根据所述第一语音指令的意图确定所述第一语音指令缺失槽位;获取历史语音指令集合中的第二语音指令,所述第二语音指令与所述第一语音指令相关;根据所述第二语音指令的槽位确定所述第一语音指令的槽位。
上述实现方式中,在第一语音指令缺失槽位(比如意图的执行者,更具体地,比如执行设备、应用或服务等)的情况下,可以利用历史语音指令集合中与第一语音指令相关的第二语音指令的槽位,来填充第一语音指令的槽位,从而得到语义完整的语音指令,进而可以使得该语音指令能够被执行。
在一种可能的实现方式中,所述根据所述第二语音指令的槽位确定所述第一语音指令的槽位包括:所述第一语音指令缺失的槽位由所述第二语音指令对应的槽位提供。
在一种可能的是实现方式中,所述根据所述第二语音指令的槽位确定所述第一语音指令的槽位之后,所述方法还包括:将所述第二语音指令的槽位添加到所述第一语音指令中,将添加有所述槽位的第一语音指令存储到所述历史语音指令集合中。
上述实现方式中,将补充完整的第一语音指令存储到历史语音指令集合中,可以为后续的语音处理操作提供依据。可选的,可以在第一语音指令执行成功后,将该第一语音指令存储到历史语音指令集合中。
可选的,所述将所述第二语音指令的槽位添加到所述第一语音指令中,包括:获取所述第二语音指令的结构化数据中的槽位,所述第二语音指令的结构化数据中的槽位为自然语言表述的槽位信息或为协议参数,所述协议参数为对所述自然语言表述的槽位信息进行映射后得到的协议参数;将所述第二语音指令的结构化数据中的槽位,添加到所述第一语音指令的结构化数据中。
可选的,所述方法还包括:若所述第一语音指令中包括用于指示所述第一语音指令的槽位的指示代词,则将所述指示代词从所述第一语音指令中删除。
在将第一语音指令补充完整后,将其中用于指示槽位的指示代词删除,可以使得该语音指令的语义更清楚。
在一种可能的实现方式中,所述获取历史语音指令集合中的第二语音指令,包括:根据所述第一语 音指令与历史语音指令集合中的历史语音指令的相关度,获取所述历史语音指令集合中与所述第一语音指令相关的第二语音指令。
可选的,所述根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,获取所述历史语音指令集合中与所述第一语音指令相关的第二语音指令,包括:根据所述第一语音指令、所述第一语音指令的意图和/或所述第一语音指令对应的关联信息,以及所述历史语音指令集合中各历史语音指令、各历史语音指令的意图和/或对应的关联信息,确定所述第一语音指令与所述历史语音指令集合中各历史语音指令的相关度;其中,所述第一语音指令对应的关联信息是在接收到所述第一语音指令时收集的,所述历史语音指令对应的关联信息是在接收到所述历史语音指令时收集的;根据所述第一语音指令与所述历史语音指令集合中各历史语音指令的相关度,从所述历史语音指令集合中选取与所述第一语音指令相关的第二语音指令。
上述实现方式中,根据语音指令对应的关联信息进行语音指令的匹配操作,可以使得匹配结果(即匹配到的第二语音指令)更加准确。
在一种可能的实现方式中,所述获取历史语音指令集合中的第二语音指令,包括:第一电子设备向云端或第三电子设备发送第一请求消息,所述第一请求消息用于请求获取所述历史语音指令集合中与所述第一语音指令关联的语音指令;其中,所述第一电子设备为所述第一语音指令的接收设备;所述第一电子设备接收所述云端或所述第三电子设备发送的第一响应消息,所述第一响应消息中携带有所述第二语音指令,所述第二语音指令是根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,从所述历史语音指令集合中获取到的。
可选的,所述第一请求消息携带所述第一语音指令、第一语音指令的意图和/或所述第一语音指令对应的关联信息。
可选的,所述第一语音指令对应的关联信息,包括以下至少一项:
设备信息,所述设备信息为所述第一语音指令的接收设备的信息;
用户账户信息,所述用户账户信息为登录语音助手的用户账户信息;
位置信息,所述位置信息为所述第一语音指令的接收设备的位置信息;
时间信息,所述时间信息包括所述第一语音指令的接收时间,和/或,所述第一语音指令与前一个接收到的语音指令之间的时间间隔;
用户身份信息,所述用户身份信息与所述第一语音指令的音频数据的特征信息相关联。
上述实现方式中,语音指令对应的关联信息可以包括多个不同维度的信息,从而提高匹配的准确性。
在一种可能的实现方式中,所述获取历史语音指令集合中的第二语音指令,以及所述根据所述第二语音指令的槽位确定所述第一语音指令的槽位,包括:云端根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,获取所述历史语音指令集合中与所述第一语音指令相关的第二语音指令;根据所述第二语音指令的槽位确定所述第一语音指令的槽位,所述第一语音指令缺失的槽位由所述第二语音指令对应的槽位提供。
在一种可能的实现方式中,所述获取第一语音指令,包括:云端对来自第一电子设备的所述第一语音指令的音频数据进行转换,得到对应的文本数据;所述确定所述第一语音指令的意图,并根据所述第一语音指令的意图确定所述第一语音指令缺失槽位,包括:所述云端对所述文本数据进行解析,得到所述第一语音指令的意图,并根据所述第一语音指令的意图确定所述第一语音指令缺失槽位;所述获取历史语音指令集合中的第二语音指令,以及所述根据所述第二语音指令的槽位确定所述第一语音指令的槽位,包括:所述云端获取历史语音指令集合中的第二语音指令,并根据所述第二语音指令的槽位确定所述第一语音指令的槽位。
在一种可能的实现方式中,所述历史语音指令集合中包括历史语音指令的结构化数据,所述历史语音指令的结构化数据包括意图和槽位。
在一种可能的实现方式中,所述槽位为执行语音指令的意图的设备或应用或服务。
第二方面,提供一种语音指令处理系统,包括:
自动语音识别模块,用于将第一语音指令的音频数据转换为文本数据;
自然语言理解模块,用于对所述第一语音指令的文本数据进行解析,得到所述第一语音指令的意图;
处理模块,若根据所述第一语音指令的意图确定所述第一语音指令缺失槽位,则获取历史语音指令 集合中的第二语音指令,根据所述第二语音指令的槽位确定所述第一语音指令的槽位;其中,所述第二语音指令与所述第一语音指令相关。
在一种可能的实现方式中,所述第一语音指令缺失的槽位由所述第二语音指令对应的槽位提供。
在一种可能的实现方式中,所述处理模块,还用于:在根据所述第二语音指令的槽位确定所述第一语音指令的槽位之后,将所述第二语音指令的槽位添加到所述第一语音指令中,将添加有所述槽位的第一语音指令存储到所述历史语音指令集合中。
可选的,所述处理模块,具体用于:获取所述第二语音指令的结构化数据中的槽位,所述第二语音指令的结构化数据中的槽位为自然语言表述的槽位信息或为协议参数,所述协议参数为对所述自然语言表述的槽位信息进行映射后得到的协议参数;将所述第二语音指令的结构化数据中的槽位,添加到所述第一语音指令的结构化数据中。
可选的,所述处理模块,还用于:若所述第一语音指令中包括用于指示所述第一语音指令的槽位的指示代词,则将所述指示代词从所述第一语音指令中删除。
在一种可能的实现方式中,所述处理模块,具体用于:根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,获取所述历史语音指令集合中与所述第一语音指令相关的第二语音指令。
可选的,所述处理模块,具体用于:根据所述第一语音指令、所述第一语音指令的意图和/或所述第一语音指令对应的关联信息,以及所述历史语音指令集合中各历史语音指令、各历史语音指令的意图和/或对应的关联信息,确定所述第一语音指令与所述历史语音指令集合中各历史语音指令的相关度;其中,所述第一语音指令对应的关联信息是在接收到所述第一语音指令时收集的,所述历史语音指令对应的关联信息是在接收到所述历史语音指令时收集的;根据所述第一语音指令与所述历史语音指令集合中各历史语音指令的相关度,从所述历史语音指令集合中选取与所述第一语音指令相关的第二语音指令。
可选的,所述第一语音指令对应的关联信息,包括以下至少一项:
设备信息,所述设备信息为所述第一语音指令的接收设备的信息;
用户账户信息,所述用户账户信息为登录语音助手的用户账户信息;
位置信息,所述位置信息为所述第一语音指令的接收设备的位置信息;
时间信息,所述时间信息包括所述第一语音指令的接收时间,和/或,所述第一语音指令与前一个接收到的语音指令之间的时间间隔;
用户身份信息,所述用户身份信息与所述第一语音指令的音频数据的特征信息相关联。
在一种可能的实现方式中,所述历史语音指令集合中包括历史语音指令的结构化数据,所述历史语音指令的结构化数据包括意图和槽位。
在一种可能的实现方式中,所述槽位为执行语音指令的意图的设备或应用或服务。
在一种可能的实现方式中,所述自动语音识别模块、所述自然语言理解模块、所述处理模块位于第一电子设备;或者,所述自动语音识别模块、所述自然语言理解模块位于第一电子设备,所述处理模块位于云端或第三电子设备;或者,所述自动语音识别模块位于第一电子设备,所述自然语言理解模块、所述处理模块位于云端;或者,所述自动语音识别模块、所述自然语言理解模块、所述处理模块位于云端。
在一种可能的实现方式中,若所述自动语音识别模块、所述自然语言理解模块位于第一电子设备,所述处理模块位于云端,则所述第一电子设备与所述云端之间进行信息交互,比如,第一电子设备可以向云端发送请求消息,以请求云端的处理模块执行相应处理操作;云端的处理模块完成相应处理操作后,可以向第一电子设备返回响应消息。可选的,所述请求消息中可以携带所述第一语音指令的结构化数据(其中包括意图)和/或所述第一语音指令对应的关联信息,所述响应消息中可以携带所述第二语音指令或者所述第二语音指令的槽位。
在一种可能的实现方式中,若所述自动语音识别模块位于第一电子设备,所述自然语言理解模块、所述处理模块位于云端,则所述第一电子设备与所述云端之间进行信息交互,比如,第一电子设备可以向云端发送请求消息,以请求云端的处理模块执行相应处理操作;云端的处理模块完成相应处理操作后,可以向第一电子设备返回响应消息。可选的,所述请求消息中可以携带所述第一语音指令的文本数据和/或所述第一语音指令对应的关联信息,所述响应消息中可以携带所述第二语音指令或者所述第二语音指令的槽位。
在一种可能的实现方式中,上述系统还包括:执行模块,用于根据所述第一语音指令的意图和槽位,执行所述第一语音指令或指示所述第一语音指令的执行设备执行所述第一语音指令,所述执行设备由所述第一语音指令的槽位提供。
在一种可能的实现方式中,上述系统还包括:自然语言生成模块、文本转语音模块;
所述执行模块,还用于获取所述第一语音指令的执行结果;
所述自然语言生成模块,用于将所述第一语音指令的执行结果转换为文本数据,所述文本数据为文本格式的自然语言;
所述文本转语音模块,用于将所述文本数据转换为音频数据。
第三方面,提供一种电子设备,包括:一个或多个处理器;所述一个或多个存储器存储有一个或多个计算机程序,所述一个或多个计算机程序包括指令,当所述指令被所述一个或多个处理器执行时,使得所述电子设备执行如上述第一方面中任意一项所述的方法。
第四方面,提供一种计算机可读存储介质,包括计算机程序,当所述计算机程序在电子设备上运行时,使得所述电子设备执行如上述第一方面中任意一项所述的方法。
第五方面,提供一种计算机程序产品,当其在电子设备上运行时,使得所述电子设备执行如上述第一方面中任意一项所述的方法。
第六方面,提供一种芯片系统,包括:存储器,用于存储计算机程序;处理器;当处理器从存储器中调用并运行计算机程序后,使得安装有该芯片系统的电子设备执行如上述第一方面中任意一项所述的方法。
以上第二方面至第六方面的有益效果请参见第一方面的有益效果,不重复赘述。
附图说明
图1为本申请实施例中的场景一的系统架构示意图;
图2为本申请实施例中的场景二的系统架构示意图;
图3为本申请实施例中的场景三的系统架构示意图;
图4为本申请实施例中的场景四的系统架构示意图;
图5为本申请实施例中的场景五的系统架构示意图;
图6为本申请实施例提供的电子设备内部硬件结构示意图;
图7为本申请实施例中的电子设备的软件结构示意图;
图8a、图8b、图8c分别为本申请实施例中语音助手的功能模块的部署示意图;
图9为本申请实施例提供的语音指令处理方法的执行逻辑示意图;
图10为本申请实施例提供的一种语音指令处理方法的流程示意图;
图11为本申请实施例中的一种通过多轮对话澄清执行设备的示意图;
图12为本申请实施例提供的一种语音指令处理方法的流程示意图;
图13为本申请实施例提供的另一种语音指令处理方法的流程示意图;
图14a、图14b为本申请实施例中示例一中的语音指令处理场景示意图;
图15a、图15b为本申请实施例中示例二中的语音指令处理场景示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一个”、“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请实施例中,“一个或多个”是指一个、两个或两个以上;“和/或”,描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施 例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本申请实施例涉及的多个,是指大于或等于两个。需要说明的是,在本申请实施例的描述中,“第一”、“第二”等词汇,仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。
当前的语音助手对语音指令的处理缺乏多轮接续的能力,要求用户每次输入的语音指令提供完整信息,比如需要包括意图和槽位(比如包括用于执行该意图的设备,或者用于执行该意图的应用程序或服务等),若用户输入的语音指令语义不完整,比如缺少槽位,则语音助手无法理解该语音指令,该语音指令无法被执行。
基于此,本申请实施例提供一种语音指令处理方法、装置、系统以及存储介质,用以在语音指令语义缺失的情况下,语音助手可以基于历史语音指令,确定当前语音指令中的缺失部分(比如执行设备信息),从而可以使得该语音指令语义完整,进而使得该语音指令能够被执行。
下面首先对本申请实施例涉及的技术术语进行说明。
(1)语音助手。
本申请实施例中的“语音助手”是指一种语音应用程序,通过智能对话与即时问答的智能交互,以帮助用户解决各种问题,比如可以帮忙用户解决生活类问题。语音助手通常具有自动语音识别(automatic speech recognition,ASR)功能以及自然语言理解(natural language understanding,NLU)模块等。ASR功能用于将音频数据转为文本数据,NLU功能用于从文本数据中解析出语音指令中包含的意图和槽位等信息。
在一种可能的实现方式中,语音助手的全部功能可以在一个电子设备中实现。在另一种可能的实现方式中,语音助手的功能可以在云端实现,语音指令的音频数据的接收功能可以在电子设备中实现。在另一种可能的实现方式中,语音助手的部分功能在电子设备中实现,其它部分功能在云端实现。在另一种可能的实现方式中,语音助手的部分功能在第一电子设备中实现,其它部分功能在第三电子设备中实现,比如,语音指令的音频数据的接收功能在第一电子设备上实现,语音助手的其它功能在第三电子设备上实现。
需要说明的是,语音助手也称为语音应用或语音应用程序,本申请实施例对于语音助手的命名方式不做限制。
(2)语音指令,以及语音指令中的意图和槽位。
意图,是指识别用户实际的或潜在的需求是什么。从根本来说,意图识别是一个分类器,将用户需求划分为某个类型;或者,意图识别是一个排序器,将用户的潜在可能需求集合按照可能性进行排序。
槽位,即意图所带的参数。一个意图可能对应若干个槽位。示例性的,槽位可以包括语音指令的执行设备信息(比如执行设备的类型,具体如“空调”),执行设备所在的位置(比如执行设备所在的房间),执行用户意图的应用程序或服务等。例如对空调温度进行控制时,需要在语音指令中明确给出执行设备为空调以及空调的位置(如果附近有多个空调),以上参数即“对空调温度进行控制”这一意图的槽位。
以槽位参数中的执行设备信息为例,执行设备信息的数据形式可以包括以下两种:
第一种:使用自然语言描述,采用槽位的方式,比如执行设备信息的形式为room(房间):“卧室”,device(设备):“灯”;
第二种:使用底层协议参数描述,直接记录底层设备细节。比如将‘卧室’‘灯’这样的自然语言描述的设备信息转换成设备的协议描述,比如将‘卧室’‘灯’映射为符合物联网(internet of things,IoT)协议的设备标识(deviceID)、设备类型(deviceType)等参数。映射后的参数被携带于控制指令发送给执行设备,以便控制设备可以执行该控制指令。
意图和槽位共同构成了“用户动作”,电子设备无法直接理解自然语言,因此意图识别的作用便是将自然语言或操作映射为机器能够理解的结构化语义表示。
以一个“控制空调温度”的需求为例,介绍语音助手对语音指令的解析结果:
用户输入的语音指令为:“客厅空调的温度调高点儿”;
解析得到的用户意图为:调高温度(TURN_UP_TEMPRATURE);
解析得到的槽位包括:槽位一(位置):客厅;槽位二(执行设备):空调。
本申请实施例中,语音指令在不同的处理阶段表现为不同的数据形态。比如,在接收阶段,电子设备接收到的是语音指令的音频数据;语音指令的音频数据可以被转换为对应的文本数据;语音指令的文本数据可以被解析为语音指令的结构化数据,比如包括意图、槽位。进一步的,语音指令的槽位可以表现为自然语言描述的槽位(比如设备名称,如“空调”),也可以被进一步映射为底层协议参数(比如设备类型、设备标识等)。语音指令的结构化数据还可以进一步转换为执行设备可识别的控制指令,以便执行设备执行相应操作。
本文以下的描述中,在有些部分仅笼统使用“语音指令”进行描述,本领域技术人员应该可以根据上下文的内容,理解“语音指令”可能的数据形态。
(3)电子设备。
本申请实施例中,电子设备可以包括手机、个人计算机(personal computer,PC)、平板电脑、台式机(桌面型电脑)、手持计算机、笔记本电脑(膝上型电脑)、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personaldigital assistant,PDA)、路由器、电视等设备。或者,电子设备可以包括音响、摄像头、空调、电冰箱、智能窗帘、台灯、吊灯、电饭煲、安防设备(如智能电子锁)、机器人、扫地机、智能秤等可以接入家庭无线局域网的设备。或者,电子设备可以包括智能耳机、智能眼镜、智能手表、智能手环、增强现实(augmented reality,AR)\虚拟现实(virtual reality,VR)设备、无线定位器、追踪器(Tracker)、电子项圈等穿戴设备,本申请实施例中的电子设备还可以是车载音响、车载空调等设备。本申请实施例对电子设备的具体形态不作特殊限制。
(4)云端。
又称云平台,是一种采用应用程序虚拟化技术(application virtualization)的软件平台,集软件搜索、下载、使用、管理、备份等多种功能为一体。通过该平台,各类常用软件都能够在独立的虚拟化环境中被封装起来,从而使应用软件不会与系统产生耦合。
下面结合附图对本申请实施例进行详细描述。
首先,对本申请实施例的几种场景进行说明。
场景一
请参见图1为本申请实施例中的场景一的系统架构示意图。如图所示,第一电子设备10具有音频采集装置或模块(比如麦克风),可以接收用户的语音,第一电子设备10中还设置有语音助手,第一电子设备10上还存储有该第一电子设备10接收以及处理过的语音指令集合(下文中称为历史语音指令集合)。历史语音指令集合中包括至少一个语音指令。
可选的,第一电子设备10还可能与第二电子设备20之间建立有通信连接。示例性的,第一电子设备10和第二电子设备20之间,可以为点对点的直接连接,如通过蓝牙连接,也可以为通过局域网方式连接,还可以为通过互联网方式连接等,此处不作限定。第二电子设备20上可能没有设置语音助手,不具备语音指令处理能力,因此需要与其它电子设备(比如第一电子设备10)连接,以便由其它电子设备处理语音指令,并接收其它语音设备基于语音指令发来的控制指令并进行响应。比如,第二电子设备20可以是空调、照明灯、窗帘等。当然,第二电子设备20也可能具备语音处理能力。
当语音助手接收到音频数据形式的语音指令(也称为语音信号)后,基于ASR功能对接收到的音频数据形式的语音指令进行识别,得到文本形式的语音指令,并基于NLU功能对该文本形式的语音指令进行解析。如果该语音指令中没有包括槽位或者仅使用指示代词来指示槽位,则语音助手查询第一电子设备10中存储的历史语音指令集合,获取与当前的语音指令相关的历史语音指令,根据该历史语音指令的槽位,确定当前语音指令的槽位,从而将当前的语音指令补充完整。之后,语音助手即可执行该语音指令。
可选的,语音助手执行语音指令的方式,可能包括以下几种情况:
情况1:
语音助手确定语音指令的用户意图为执行某种操作,槽位中的执行者为应用程序或服务,则该语音助手通过调用第一电子设备10或云端上的相应应用程序或服务来实现用户的意图。
例如,用户输出的语音指令为“付款”,语音助手基于历史语音指令集合,确定出的槽位为“支付宝”应用,则将该语音指令补充完整,补充完整后的语音指令为“使用支付宝付款”,则语音助手调用所在电子设备上安装的“支付宝”应用,完成支付操作。
再例如,用户输出的语音指令为“备份文件”,语音助手基于历史语音指令集合,确定出的槽位为“华为云”服务,则将该语音指令补充完整,补充完整后的语音指令为“将文件备份到华为云”,则语音助手调用相应的服务接口,通过与云端交互,由位于云端的服务来完成相应操作。
情况2:
语音助手确定语音指令的用户意图为执行某种操作,槽位为该语音助手所在的第一电子设备10,则语音指令通过调用第一电子设备10中的相应功能接口,使得第一电子设备10执行该语音指令。
例如,用户输出的语音指令为“播放电影XXX”,语音助手基于历史语音指令集合,确定出的槽位为“电视”,则将该语音指令补充完整,补充完整后的语音指令为“在电视上播放电影XXX”,则语音助手通过调用智能电视的相关功能接口,以使得智能电视响应该语音指令以播放该电影。
情况3:
语音助手确定语音指令的意图为执行某种操作,槽位中的执行设备为第二电子设备20,则语音助手根据语音指令生成第二电子设备20能够识别的控制指令,将该控制指令发送给第二电子设备20,使得第二电子设备20执行该控制指令。
例如,第一电子设备10(智能音箱)接收到的语音指令为“播放电影XXX”,语音助手基于历史语音指令集合,确定出的槽位为“电视”,则将该语音指令补充完整,补充完整后的语音指令为“在电视上播放电影XXX”,并将该语音指令转换为第二电子设备20(电视)能够识别的控制指令,将该控制指令发送给第二电子设备20(电视),以使得电视响应该控制指令播放该电影。
场景二
请参见图2,为本申请实施例中的场景二的系统架构示意图。如图所示,第一电子设备10具有音频采集装置或模块(比如麦克风),可以接收用户的语音,第一电子设备10中还设置有语音助手。在云端设置有历史语音指令集合,第一电子设备10可以将其接收以及处理过的语音指令存储到云端。可选的,其它电子设备,比如图中的第三电子设备30,也可以将接收以及处理过的语音指令存储到云端的历史语音指令集合中。
可选的,第一电子设备10还可能与第二电子设备20之间建立有通信连接。示例性的,第一电子设备10和第二电子设备20之间,可以为点对点的直接连接,如通过蓝牙连接,也可以为通过局域网方式连接,还可以为通过互联网方式连接等,此处不作限定。
可选的,第一电子设备10和第三电子设备30属于同一分组,同一分组内的电子设备共享云端的一个历史语音指令集合,不同分组的电子设备不能共享同一历史语音指令集合。所述分组可以由用户设置,例如,可以将同一家庭居住地的电子设备划分到同一分组,也可以将同一家庭的多个居住地的电子设备划分到同一分组,本申请实施例对此不作限制。
在一种可能的实现方式中,当语音助手接收到音频数据形式的语音指令后,基于ASR功能对接收到的音频数据形式的语音指令进行识别,得到文本形式的语音指令,并基于NLU功能对该文本形式的语音指令进行解析。如果该语音指令中没有包括槽位或者仅使用指示代词来指示槽位,则语音助手查询云端的历史语音指令集合,获取与当前的语音指令相关的历史语音指令,根据该历史语音指令的槽位,确定当前语音指令的槽位,从而将当前的语音指令补充完整。之后,语音助手即可执行该语音指令。其中,语音助手执行语音指令的方式,可以参见场景中的相关说明。
场景三
请参见图3,为本申请实施例中的场景三的系统架构示意图。如图所示,第一电子设备10和第三电子设备30分别具有音频采集装置或模块(比如麦克风),可以接收用户的语音,第一电子设备10和第三电子设备30中还设置有语音助手。
第一电子设备10和电子设备30之间建立有通信连接。示例性的,第一电子设备10和第三电子设备30之间,可以为点对点的直接连接,如通过蓝牙连接,也可以为通过局域网方式连接,还可以为通过互联网方式连接等,此处不作限定。
可选的,第一电子设备10和第三电子设备30属于同一分组,同一分组内的电子设备可以共享历史 语音指令集合,不同分组的电子设备不能共享历史语音指令集合。所述分组可以由用户设置,例如,可以将同一家庭居住地的电子设备设置划分到同一分组,也可以将同一家庭的多个居住地的电子设备划分到同一分组,本申请实施例对此不作限制。
第一电子设备10和第三电子设备30可以共享历史语音指令,使得第一电子设备10和第三电子设备30中的历史语音指令集合实现同步。例如,当第一电子设备10接收并处理完成一个语音指令后,可将处理后的语音指令存储到本设备的历史语音指令集合,并将该语音指令发送给第三电子设备30,使得第三电子设备30将该语音指令存储到第三电子设备30中的历史语音指令集合,用以实现第一电子设备10和第三电子设备30中的历史语音指令集合之间的同步。再例如,第一电子设备10和第三电子设备30之间,可以按照设定时间或周期,同步历史语音指令集合。
可选的,第一电子设备10还可能与第二电子设备20之间建立有通信连接。示例性的,第一电子设备10和第二电子设备20之间,可以为点对点的直接连接,如通过蓝牙连接,也可以为通过局域网方式连接,还可以为通过互联网方式连接等,此处不作限定。
当语音助手接收到音频数据形式的语音指令后,基于ASR功能对接收到的音频数据形式的语音指令进行识别,得到文本形式的语音指令,并基于NLU功能对该文本形式的语音指令进行解析。如果该语音指令中没有包括槽位或者仅使用指示代词来指示槽位,则语音助手查询所在电子设备上的历史语音指令集合(该历史语音指令集合为多个电子设备共享的历史语音指令集合),获取与当前的语音指令相关的历史语音指令,根据该历史语音指令的槽位,确定当前语音指令的槽位,从而将当前的语音指令补充完整。之后,语音助手即可执行该语音指令。其中,语音助手执行语音指令的方式,可以参见场景中的相关说明。
场景四
语音助手中的部分功能可能位于端侧(电子设备上),另外的功能位于云端。图4示例性示出了一种本申请实施例中的场景四的系统架构示意图。如图所示,语音助手中的ASR功能和NLU功能位于第一电子设备10,针对槽位缺失的语音指令进行槽位补充的功能(本申请实施例中称为处理功能)位于云端,历史语音指令集合也位于云端。
第一电子设备10具有音频采集装置或模块(比如麦克风),可以接收用户的语音。可选的,第一电子设备10还可能与第二电子设备20之间建立有通信连接。第二电子设备20上可能没有设置语音助手,不具备语音指令处理能力,因此需要与其它电子设备(比如第一电子设备10)连接,以便由其它电子设备处理语音指令,并接收其它语音设备基于语音指令发来的控制指令并进行响应。当然,第二电子设备20也可能具备语音处理能力。
当第一电子设备10接收到音频数据形式的语音指令后,基于ASR功能对接收到的音频数据形式的语音指令进行识别,得到文本形式的语音指令,并基于NLU功能对该文本形式的语音指令进行解析。如果该语音指令中没有包括槽位或者仅使用指示代词来指示槽位,则语音助手向云端发送请求消息;云端根据该请求消息,基于云端存储的历史语音指令集合获取与当前的语音指令相关的历史语音指令,并将该相关的历史语音指令的相关数据发送给第一电子设备10,使得第一电子设备10中的语音助手可以将该历史语音指令的槽位将当前的语音指令补充完整。之后,语音助手即可执行该语音指令。其中,语音助手执行语音指令的方式,可以参见场景中的相关说明。
场景五
语音助手中的部分功能可能位于端侧(电子设备上),另外的功能位于云端。图5示例性示出了一种本申请实施例中的场景五的系统架构示意图。如图所示,语音助手中的ASR功能和NLU功能位于云端,针对槽位缺失的语音指令进行槽位补充的功能(本申请实施例中称为处理功能)也位于云端,云端还存储有历史语音指令集合。
第一电子设备10具有音频采集装置或模块(比如麦克风),可以接收用户的语音。可选的,第一电子设备10还可能与第二电子设备20之间建立有通信连接。
当第一电子设备10接收到音频数据形式的语音指令后,将该音频数据形式的语音指令发送给云端;云端基于ASR功能对接收到的音频数据形式的语音指令进行识别,得到文本形式的语音指令,并基于NLU功能对该文本形式的语音指令进行解析。如果该语音指令中没有包括槽位或者仅使用指示代词来指示槽位,则基于云端存储的历史语音指令集合获取与当前的语音指令相关的历史语音指令,并将该相关 的历史语音指令的相关数据发送给第一电子设备10,使得第一电子设备10中的语音助手可以将该历史语音指令的槽位将当前的语音指令补充完整。之后,语音助手即可执行该语音指令。其中,语音助手执行语音指令的方式,可以参见场景中的相关说明。
在另外的例子中,ASR功能可以位于电子设备侧。
需要说明的是,以上各场景的系统架构仅为一些示例,本申请实施例对此不做限制,比如可以包括更多数量和更多类型的电子设备。
参见图6,为本申请实施例提供的电子设备100的内部硬件结构示意图。所述电子设备100可以是上述图1、图2和图3所示的各场景中的电子设备。
电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C。进一步的,还可包括耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。
处理器110可以包括一个或多个处理单元。例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,显示处理单元(display process unit,DPU),和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。在一些实施例中,电子设备100也可以包括一个或多个处理器110。其中,处理器是电子设备100的神经中枢和指挥中心。处理器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。处理器110中还可以设置存储器,用于存储指令和数据。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。其中,USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。
传感器模块180中可以包括以下一种或多种:压力传感器180A、陀螺仪传感器180B、气压传感器180C、磁传感器180D、加速度传感器180E、距离传感器180F、接近光传感器180G、指纹传感器180H、温度传感器180J、触摸传感器180K、环境光传感器180L、骨传导传感器180M。
充电管理模块140用于从充电器接收充电输入。电源管理模块141用于连接电池142,充电管理模块140与处理器110。电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可以用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如,可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号 传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如Wi-Fi网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。电子设备100可以通过ISP、摄像头193、视频编解码器、GPU、显示屏194以及应用处理器等实现拍摄功能。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
电子设备100通过GPU,显示屏194,以及应用处理器AP等可以实现显示功能。显示屏194用于显示图像,视频等。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
可以理解,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件、软件或软件和硬件的组合实现。
参见图7,为本申请实施例中的电子设备的软件结构示意图。所述电子设备可以是上述图1、图2、图3、图4或图5所示的各种场景中的电子设备。
以电子设备100为例,电子设备100的软件架构可以包括应用层701、框架层702、本地库(nativelibraries)&安卓运行时(Android Runtime)703、硬件抽象层(hardware abstractionlayer,HAL)704和内核705。本申请实施例以电子设备100的操作系统是安卓系统为例进行说明。电子设备100也可以是鸿蒙系统、IOS系统或者其它操作系统,本申请实施例不做限定。
应用层701可以包括语音助手7011。可选的,应用层701还可以包括其它应用,示例性的,所述其它应用可以包括相机,图库,日历,通话,地图,导航,音乐,视频,短信息等,本申请实施例不做限定。
语音助手7011可以控制电子设备100通过语音与用户进行交流。语音助手7011可以访问历史语音指令集合,用以根据历史语音指令集合获取与当前语音指令相关的历史语音指令,从而根据该历史语音指令的槽位,确定当前语音指令中缺失的槽位。可选的,历史语音指令集合可以位于电子设备100本地,也可能位于云端。如果位于云端,语音助手7011可以通过电子设备100的通信模块访问云端的历史语音指令集合,也可以请求云端查询与当前语音指令相关的历史语音指令。
框架层702可以包括活动管理器、窗口管理器,内容提供器,资源管理器,通知管理器等,本申请实施例对此不做任何限制。
本地库&安卓运行时703:包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
HAL704可以包括麦克风、扬声器、WIFI模组、蓝牙模组、摄像头、传感器等。
内核705是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
应理解,图7所示的各分层中包括的模块为本申请实施例中涉及到的模块,上述各分层中包括的模块并不构成对电子设备的结构和模块部署的层级(示例说明)的限定。在一种实施例中,图7中所示的模块可以单独部署,或者几个模块可以部署在一起,图7中对模块的划分为一种示例。在一种实施例中,图7中所示的模块的名称为示例说明。
本申请实施例中的语音助手,其全部功能可以部署在电子设备中,或者其全部功能部署在云端,或者部分功能部署在电子设备中、部分功能部署在云端。图8a、图8b和图8c分别示出了几种可能的部署方式。
图8a示例性示出了语音助手的全部功能部署在电子设备中的示意图。如图所示,电子设备具有语音接收功能和语音输出功能。电子设备中安装有语音助手,语音助手可以包括管理模块801、ASR模块802、NLU模块803、处理模块804,执行模块805,进一步的,还可以包括:自然语言生成(natural language generation,NLG)模块806、文本转语音(text tospeech,TTS)模块807。
管理模块801用于通过麦克风采集语音信号(也可以称为语音数据、语音指令的音频数据、音频信号等,简称为语音),以及通过扬声器播放需要输出的音频数据。
ASR模块802用于将语音指令的音频数据转为文本数据。比如将用户通过麦克风输入的语音“小艺,请播放音乐”转换成应用程序可识别的文本数据(可以称为文本信息,简称为文本)。
NLU模块803用于对语音指令的文本数据进行解析,以获得用户意图和槽位,比如从文本数据“小艺,请播放音乐”中,识别出用户意图为“播放音乐”。
处理模块804用于在NLU模块803解析得到的语音指令的数据结构中槽位缺失的情况下,基于语音历史数据集合,得到与当前语音指令相关的历史语音指令,根据该历史语音指令的槽位,将当前语音指令的槽位补充。进一步的,可以将补充完整的语音指令提供给执行模块805,或者发送给NLU模块803,再由NLU模块803发送给执行模块805。在另一种可能的实现方式中,处理模块804也可以将待补充到第一语音指令的槽位发送给NLU模块803,由NLU模块803将第一语音指令补充完整后发送给执行模块805。
执行模块805用于基于解析得到的用户意图和槽位,在本设备上执行相应的任务,即根据用户意图进行响应。比如用户意图为“播放音乐”,槽位中的执行设备为“智能手机”时,执行模块805可以调用本设备(智能手机)的音乐播放器播放音乐。在另一些场景下,如果解析得到的槽位表明该用户意图的执行者为云端服务,则执行模块805可以调用相应的云端服务,以执行该用户意图。在另一些场景下,如果解析得到的槽位表明该用户意图的执行者不是本设备,而是其它电子设备,则执行模块805可以根据该用户意图和槽位,生成所述其它电子设备能够识别的控制指令,并将该控制指令通过该电子设备100的通信模块发送给所述其它电子设备。执行模块805还可以将语音指令的执行结果(结构化的数据形式)发送给NLG模块806。
NLG模块806用于根据语音指令的执行结果(结构化的数据)生成自然语言形式的执行结果,比如根据回复信息的结构化数据生成自然语言的回复信息“好的,已经播放音乐”。
TTS模块807用于将来自于NLG模块806的执行结果的文本数据转为音频数据,并可以将执行结果的音频数据发送给管理模块801,管理模块801可以将执行结果的音频数据送至扬声器上播放。
可选的,处理模块804在当前语音指令完成后,将当前语音指令存储到历史语音指令集合中。处理模块804也可以将本设备中的历史语音指令集合与其它电子设备中的历史语音数据集合进行共享或同步。
参见图8b,为本申请实施例提供的另一种语音助手的部署示意图。其中,语音助手中的处理模块804部署在云端,历史语音指令集合也被部署在云端,其它功能模块部署在电子设备中。
参见图8c,为本申请实施例提供的另一种语音助手的部署示意图。其中,语音助手中的管理模块801、ASR模块802、NLU模块803、处理模块804部署在云端,历史语音指令集合也被部署在云端。云端中还可以部署NLG模块806和TTS模块807。电子设备具有语音接收功能和语音输出功能,可以接收语音指令的音频数据,也可以输出语音指令执行结果的语音。
需要说明的是,以上图8a、图8b和图8c仅示例性示出了几种语音助手中各功能模块的部署情况,实际应用场景中,可能存在其他部署情况,或者其他功能模块的划分方式,本申请实施例对此不做限制。
图9以一个具体应用场景为例,示出了本申请实施例提供的语音指令处理方法的执行逻辑示意图。如图所示,用户在使用语音助手时,采用这样的流程:首先用户对设置有语音助手的电子设备发出语音“调亮点儿”,语音助手将该语音转换成文本,然后语音助手的NLU模块对该文本进行自然语言处理,提取出用户意图为:调亮设备,由于缺失槽位(执行设备),则语音助手启动以下对话状态跟踪过程:
语音助手获取历史语音指令集合,将本轮对话(即当前语音指令)与历史语音指令集合中的历史语音指令进行匹配,获得与本轮对话相关的历史轮对话,将该相关历史轮对话中的执行设备信息加到本轮对话的槽位中,补充完整后的本轮对话为:
意图:调亮设备;
槽位:灯。
语音助手将补充完整的本轮对话的语音指令转换为执行设备“灯”能够识别和执行的控制指令,并将该控制指令发送给该执行设备“灯”,以使该执行设备执行该控制指令。
可选的,语音助手在将本轮对应的语音指令补充完整后,还启动以下对话策略过程:将补充完整的语音指令存储到历史语音指令集合中。该历史语音指令集合可以位于电子设备本地,也可以位于云端。
可选的,语音助手接收到该执行设备“灯”反馈的执行成功的信息后,生成回复语音的数据结构:
动作:调亮设备;
执行设备:灯;
回复:成功。
语音助手中的NLG模块根据该回复语音的数据结构生成自然语言“好的,灯已经调亮了”,并通过TTS模块将该自然语言转换为音频数据,并播放该音频数据。
下面基于以上语音助手的软件架构以及语音指令处理逻辑,对本申请实施例提供的语音指令处理方法进行说明。
参见图10,为本申请实施例提供的一种语音指令处理方法的流程示意图。该流程中,针对缺失槽位(比如缺失执行设备)的当前语音指令,语音助手可以基于历史语音指令集合获取与该语音指令相关的历史语音指令,将该历史语音指令的槽位添加为当前的语音指令中,使得当前的语音指令完整。
图10所示的流程中,语音助手的功能部署在电子设备中,即该流程可以应用于图8a所示的架构。语音助手的功能也可以部署在云端,即该流程也可以应用于图8c所示的架构。如10图所示,该流程可以包括以下步骤:
步骤1:电子设备中的音频模块接收用户的语音,得到语音指令的音频数据,并将该语音指令的音频数据发送给语音助手中的管理模块。
可选的,在使用语音助手之前,用户首先使用用户账户登录语音助手。在登录语音助手后,电子设备的音频模块(比如麦克风)被启动,当用户发出语音时,该麦克风接收用户的语音,得到语音指令的音频数据。
步骤2a:语音助手中的管理模块接收到该语音指令的音频数据后,将该语音指令的音频数据发送给语音助手中的ASR模块。
步骤2b:可选的,管理模块还可以将设备信息、位置信息,用户账户信息、时间信息、用户身份信息等信息进行缓存,比如缓存到内存中的指定区域或者缓存后将缓存地址通知给相关功能模块(比如处理模块),以便处理模块能够获取到这些信息,并将这些信息作为当前语音指令对应的关联信息,作为获取与当前语音指令相关的历史语音指令的依据。
所述设备信息为接收到该语音指令的设备的信息,示例性的,设备信息可以包括设备类型、设备标识(diviceID)、设备状态、设备名称等。所述位置信息为当前语音指令的接收设备的位置信息。所述用户账户信息为登录语音助手的用户的账户信息,示例性的,用户账户信息可以包括用户标识。时间信息为当前的时间。所述用户身份信息可以是音频数据的特征信息(声纹),可以通过对接收到的语音指令的音频数据进行特征提取得到。所述用户身份信息也可以是用户标识等能够表明用户身份的信息(比如可以是用户在家庭中的成员角色),可以在提取语音指令的音频数据的特征信息后,通过该特征信息查询用户列表(用户列表中记录有声纹与用户标识的对应关系)得到对应的用户标识。所述设备信息、用户账户信息、时间信息、用户身份信息中的一个或多个,可以作为当前语音指令对应的关联信息,用于 后续的语音指令处理过程。
步骤3:语音助手中的ASR模块对语音指令的音频数据进行识别,得到语音指令的文本数据。
ASR模块可以采用ASR算法对语音指令的音频数据进行识别,本申请实施例对ASR算法不做限定。
步骤4:语音助手中的ASR模块将语音指令的文本数据,发送给NLU模块。
步骤5:语音助手中的NLU模块对语音指令的文本数据进行解析,得到用户意图。
NLU模块对该语音指令的文本数据进行解析时,若解析得到用户意图和槽位,则该语音指令语义完整,因此执行后续的步骤10,将语音指令的结构化数据(包括用户意图和槽位)发送给执行模块;若解析得到用户意图,但未解析得到槽位,或者槽位被使用指示代词来指代,则表明该语音指令的语义不完整,此种情况下,执行后续的步骤6~9。
NLU模块可以采用NLU算法对语音指令的文本数据进行解析,本申请实施例对NLU算法不做限定。
步骤6:语音助手中的NLU模块向处理模块发送第一指示信息,用于指示处理模块获取该语音指令中缺失的槽位。
可选的,第一指示信息可以包括NLU模块解析得到的用户意图。
步骤7:语音助手中的处理模块根据该第一指示信息,通过查询历史语音指令集合,获得相关的历史语音指令,并进一步的获取该历史语音指令的槽位。
可选的,处理模块可以获取之前缓存的与当前语音指令对应的关联信息,将该关联信息作为获取匹配的历史语音指令的依据之一。
可选的,处理模块在获得该语音指令执行成功的响应信息后,可以将该语音指令,进一步的连同该语音指令对应的关联信息,存储到历史语音指令集合中,以便在后续轮对话时基于该历史语音指令集合,对语义缺失的语音指令进行处理。
可选的,历史语音指令集合中的历史语音指令,可以是结构化的数据,相应的,在基于历史语音指令集合获取与第一语音指令相关的第二语音指令时,可以使用第一语音指令的结构化数据和历史语音指令的结构化数据进行匹配。历史语音指令的结构化的数据包括用户意图和槽位,该槽位使用自然语言描述,比如“空调”。该结构化的数据可以是由语音助手中的NLU模块解析得到的。在NLU模块解析得到的结构化数据部分缺失(比如缺失槽位)的情况下,可以将缺失的部分(比如槽位)补充完整,将补充完整的结构化数据存储到历史语音指令集合中。历史语音指令的结构化数据的一个示例为:
用户意图:调高温度(TURN_UP_TEMPRATURE);
槽位:槽位一(位置):客厅;槽位二(执行设备):空调。
可选的,在历史语音指令集合中的历史语音指令以结构化数据的形式存储的情况下,该结构化数据中的槽位也可以是通过底层映射得到的底层协议参数,比如,可以将上述“客厅”和“空调”映射为用于唯一标识该电子设备(空调)的设备标识(deviceID)。采用此种结构化的数据形式,基于历史语音指令集合获取到的槽位为底层协议参数,无需再进行解析或特征抽取,直接将该底层协议参数添加到语义不完整(缺失槽位)的语音指令中即可。历史语音指令的结构化数据的一个示例为:
用户意图:调高温度(TURN_UP_TEMPRATURE);
槽位:deviceID_air-conditioning_01。其中,“deviceID_air-conditioning_01”表示位于客厅的空调的设备标识。
可选的,历史语音指令集合中的历史语音指令,可以是文本数据,相应的,在基于历史语音指令集合获取与第一语音指令相关的第二语音指令时,可以使用第一语音指令的文本数据和历史语音指令的文本数据进行匹配。历史语音指令的文本数据可以是由语音助手中的ASR模块根据语音指令的音频数据识别得到的。也可能是在基于ASR模块识别得到的语音指令语义不全的情况下,将该语音指令的语义补充完整后得到的。历史语音指令的文本数据的一个示例为“客厅空调的温度调高点儿”。
可选的,历史语音指令集合中,还可以包括与历史语音指令对应的关联信息。可选的,一个历史语音指令对应的关联信息,可以包括一个或多个维度的信息。示例性的,一个历史语音指令对应的关联信息可以包括以下信息中的至少一项:时间信息、接收设备的设备信息、接收设备的位置信息、用户账户信息、用户身份信息等。一个历史语音指令对应的关联信息,是接收到该历史语音指令时收集的。
一个历史语音指令对应的关联信息中的时间信息,可以是该历史语音指令被接收到时的时间或者该语音指令被处理完成时的时间(比如对该语音指令进行ASR处理后的时间,或者对该语音指令进行NLU 处理时的时间);该关联信息中的设备信息,是该历史语音指令的接收设备的设备信息。示例性的,所述设备信息可以包括设备类型、设备标识(deviceID)、设备状态、设备名称等;该关联信息中的用户账户信息用于指示接收以及处理该历史语音指令时的语音助手的登录用户;该关联信息中的用户身份信息用于指示该发出该历史语音指令的用户,例如,如果史语音指令x与用户标识y相对应,则表明该历史语音指令x在用户基于用户标识y登录语音助手后接收和处理的语音指令。关于语音指令对应的关联信息的解释和说明,可以参考步骤2中的相关描述。
根据以上实施例,历史语音指令集合的一个示例可以如表1所示。
表1:历史语音指令集合。
表1中,一个历史语音指令对应的关联信息包括:用户账户信息、用户身份信息、接收设备、时间。
可选的,一个语音指令对应的关联信息,还可以包括该语音指令与接收到的上一个语音指令之间的时间间隔,对话地点,语义相关性等信息,进一步的,还可以包括用户的生理特征,接收当前语音指令的设备的周边设备列表,对话轮数等信息,在此不再一一列举。语音助手在接收到一个语音指令后,可以收集上述信息,以得到该语音指令对应的关联信息,该关联信息可以与该语音指令一起被存储到历史语音指令集合中。
在一种可能的实现方式中,在语音对话过程中,语音助手对语音指令进行解析,获得执行设备的类型后,如果发现该类型的设备有多个,则语音助手通过语音提示,让用户对多个设备进行澄清或选择;语音助手获得用户选择的执行设备后,一方面根据该语音指令生成控制指令,并将控制指令发送给执行设备执行,另一方面,根据用户选择的执行设备更新语音指令的结构化数据,比如更新语音指令的执行设备信息,并将更新后的语音指令的结构化数据存储到历史语音指令集合中。
示例性的,图11示出了一种通过多轮对话澄清执行设备的示意图。如图所示,语音助手接收到的用户语音为“调高空调温度”后,将该语音识别为文本,对该文件进行解析,得到该语音指令的结构化数据1:
用户意图:TURN_UP_TEMPRATURE;
槽位:“空调”。
语音助手确定执行设备的类型为空调,查询已注册的该类型设备,发现存在有“卧室的空调”“客厅的空调”等多个候选执行设备,则向用户输出提示语音“你有多个空调,要调节哪个:1、卧室的空调,2、客厅的空调,3、……”;
语音助手接收到用户用于选择执行设备的语音“客厅的”后,将该语音转换为文本,对该文本进行解析,根据解析到的执行设备更新上述语音指令的结构化数据1,得到更新后的该语音指令的结构化数据2或结构化数据3:
结构化数据2:
用户意图:TURN_UP_TEMPRATURE;
槽位:“空调”“客厅”。
结构化数据3:
用户意图:TURN_UP_TEMPRATURE;
槽位:deviceID_air-conditioning_01。
其中,结构化数据3中的“deviceID_air-conditioning_001”表示位于客厅的空调的设备标识,是通过对使用自然语言描述的“空调”“客厅”进行映射得到的。
上述该语音指令的结构化数据2或结构化数据3,将被存储到历史语音指令集合中。
在一种可能的实现方式中,在语音对话过程中,语音助手对语音指令进行解析,如果该语音指令中缺失执行设备,则语音助手可以选择默认的执行设备作为该语音指令的执行设备。语音助手根据执行结果确定该语音指令执行失败,并确定失败原因为执行设备错误,则语音助手可以通过发起与用户间的对话,向用户询问执行设备。语音助手在获得执行设备,更新该语音指令的结构化数据,向该执行设备发送对应的控制指令,并接收到执行成功的响应后,可以将该更新后的语音指令的结构化数据存储到历史语音指令集合中。
在一种可能的实现方式中,为了节省存储空间,可以将历史语音指令集合中,距离当前时间的时间间隔超过设定阈值的历史语音指令进行清除。
在一种可能的实现方式中,历史语音指令集合存储于电子设备本地,该历史语音指令集合中的语音指令仅包括本设备接收和处理过的语音指令。
在另一种可能的实现方式中,历史语音指令集合存储于电子设备本地,该历史语音指令集合中的语音指令包括本设备接收和处理过的语音指令,还可以包括其它电子设备接收和处理过的语音指令。多个电子设备之间可以建立通信连接,以便彼此之间同步历史语音指令。这些电子设备之间,可以按照设定周期或者设定时间同步历史语音指令,也可以当其中一个电子设备接收并处理语音指令后,由该电子设备将该语音指令作为历史语音指令,同步给其它电子设备。
可选的,这些彼此之间同步或共享历史语音指令的电子设备,可以预先被配置为一个分组,分组信息可以分别配置在这些电子设备上,分组信息可以包括组标识、分组内各成员设备的标识或地址等信息。一个分组内的各成员设备之间可以根据分组信息建立连接,也可以基于用户操作建立通信连接,比如响应于用户的蓝牙配对操作建立蓝牙连接。
可选的,可以将一个区域内(比如一个家庭住宅内)的全部或部分电子设备划分为一个分组。一个分组内的电子设备可以包括不同类型的电子设备,比如可以包括智能音箱、智能电视、智能手机等。
可选的,一个分组内的成员设备可以是同一用户账户信息关联的电子设备。例如,在一个电子设备仅关联一个用户账户的情况下(即仅能使用一个用户账户登录电子设备中的语音助手的情况下),可以将与同一个用户账户关联的电子设备划分为一个分组。
可选的,可以配置一个分组,也可以配置多个分组。同一个分组内的成员设备共享历史语音指令集合。
在另一种可能的实现方式中,历史语音指令集合存储于网络侧,比如存储于云端。可以在电子设备中配置云端历史语音指令集合的存储地址,使得电子设备可以将历史语音指令存储到云端的历史语音指令集合中。
根据历史语音指令集合的存储位置,电子设备查询历史语音指令集合的方式可以包括:
方式1:在电子设备本地存储有历史语音指令集合的情况下,可以直接查询本地存储的历史语音指令集合;
方式2:在历史语音指令集合存储于云端的情况下,电子设备可以与云端交互,以查询历史语音指令集合,比如发送查询请求,该查询请求中可以携带查询条件或关键词,比如用户意图、时间范围、用户标识等,用以获得符合查询条件的历史语音记录。
在一种可能的实现方式中,处理模块从历史语音指令集合中获取与当前语音指令相关的历史语音指令的过程,可以包括:确定历史语音指令集合中的历史语音指令与当前语音指令(即当前接收的缺失槽位的语音指令,也即本申请实施例中的“第一语音指令”)的相关度,根据当前语音指令与各历史语音指令的相关度,从历史语音指令集合中选择与当前语音指令相关的历史语音指令,比如可以选择相关度最高的历史语音指令。
可选的,处理模块在计算相关度时,可以依据以下信息来计算:
(1)根据第一语音指令的文本数据和历史语音指令的文本数据,计算第一语音指令和历史语音指令的相关度;
(2)根据第一语音指令的结构化数据和历史语音指令的结构化数据,计算第一语音指令和历史语音 指令的相关度;
(3)根据第一语音指令的意图和历史语音指令的意图,计算第一语音指令和历史语音指令的相关度;
(4)根据第一语音指令对应的关联信息和历史语音指令对应的关联信息,计算第一语音指令和历史语音指令的相关度;
(5)根据第一语音指令(比如文本数据或结构化数据)和对应的关联信息,以及历史语音指令(比如文本数据或结构化数据)和对应的关联信息,计算第一语音指令和历史语音指令的相关度;
(6)根据第一语音指令的意图和第一语音指令对应的关联信息,以及历史语音指令的意图和该历史语音指令对应的关联信息,计算第一语音指令和历史语音指令的相关度;
(7)根据第一语音指令(比如文本数据或结构化数据)和对应的关联信息,以及历史语音指令(比如文本数据或结构化数据)和对应的关联信息,计算第一语音指令和历史语音指令的相关度;
(8)根据第一语音指令(比如文本数据)、第一语音指令的意图和对应的关联信息,以及历史语音指令(比如文本数据)、历史语音指令的意图和对应的关联信息,计算第一语音指令和历史语音指令的相关度。
以上仅示例性的列举出了几种可能的相关度计算依据,本申请实施例对此不做限制。
可选的,可以使用相关性得分来表征相关度。
在一种可能的实现方式中,相关性得分的计算过程可以包括:首先,在电子设备接收到当前的语音指令后,获取该语音指令对应的关联信息,然后,通过预先设定的规则,将当前语音指令对应的关联信息,与历史语音指令对应的关联信息进行匹配,得到当前语音指令与历史语音指令的相关性得分。
可选的,所述关联信息可以包括一个或多个维度的信息,例如,所述关联信息可以包括以下信息的一项或多项:该电子设备的信息(比如包括设备名称,设备类型,设备状态等),用户账户信息(即登录语音助手的用户账户信息),用户身份信息(该用户身份信息与语音指令的音频数据的特征信息相关联),语音指令的接收时间,与接收到的上一个语音指令之间的时间间隔,对话地点,语义相关性等信息,用户的生理特征,接收当前语音指令的设备的周边设备列表,对话轮数等。相应的,在计算相关性得分时,可以使用一些规则性的方法,计算当前语音指令与各历史语音指令在这些维度上的匹配度,根据各维度上的匹配度,综合计算得到相关性得分。
在另一种可能的实现方式中,相关性得分的计算过程可以包括:首先,将当前接收的语音指令以及该语音指令对应的关联信息进行编码,比如,将各个设备上对话场景下的历史对话信息(包括输入文本、设备状态等),统一利用一个自然语言编码器进行编码,从而形成对历史轮、当前轮的关联信息进行编码;然后,将编码结果输入深度学习神经网络进行推断,得到当前语音指令与历史语音指令之间的相关性得分。
需要说明的是,以上仅示例性的列举了几种相关度(相关性得分)的计算方法,本申请实施例对语音指令之间的相关度(相关性得分)的计算方法不做限制。
在一种可能的实现方式中,如果基于历史语音指令集合,未获取到相关的历史语音指令,则语音助手可以发起与用户的对话,以获取语音指令中的缺失部分,比如,语音助手生成用于询问槽位或指导用户给出槽位或者引导用户给出完整语音指令的提示信息,将该提示信息转换为音频数据,通过音频模块输出该提示信息,以指导用户给出完整的语音指令。
步骤8:语音助手中的处理模块将获取到槽位发送给NLU模块,或者将补充后的语音指令发送给NLU模块。
可选的,如果获取到的相关的历史语音指令为文本数据,则NLU模块解析出该历史语音指令中的槽位,并将该槽位添加到当前接收到的语音指令的文本数据中。
可选的,如果获取到的相关的历史语音指令为结构化数据,并且该结构化数据中的槽位使用自然语言描述,则NLU模块可以直接从该结构化数据中获取该槽位,并将该槽位添加到当前接收的语音指令的结构化数据中。
可选的,如果获取到的相关的历史语音指令为结构化数据,并且该结构化数据中的槽位为映射后的底层协议参数,比如已经被映射为设备标识,则NLU模块可以直接将该映射后的底层协议参数添加到当前接收的语音指令的结构化数据中。此种情况下,后续步骤中可以不再对该槽位进行映射,节省了处理开销。
可选的,如果当前接收的语音指令中包含有指示代词,该指示代词用于指代槽位,比如用于指代执行设备,则NLU模块在将槽位添加到当前接收的语音指令的结构化数据之后,可以将该语音指令中的指示代词删除。
步骤9:语音助手中的NLU模块将语音指令的结构化数据(包括用户意图和槽位),发送给执行模块。
可选的,如果处理模块获取到的槽位没有被映射为底层协议参数,则NLU模块可以将该槽位映射为底层协议参数,比如映射为设备标识,并将映射后的槽位参数发送给执行模块。
步骤11:语音助手中的执行模块接收到语音指令的用户意图和槽位后,进行用于执行该语音指令的相关处理操作。
比如,执行模块可以调用相应的应用程序或服务或本设备上的相应功能来执行该用户意图,或者根据该语音指令(包括用户意图和槽位)生成执行设备能够识别和执行的控制指令,并将该控制指令发送给该执行设备。
示例性的,所述控制指令中可以包括通用的命令信息,比如命令类型、命令参数等,还包括执行设备的设备信息。
可选的,如果语音指令的结构化数据中的槽位已经是映射后的底层协议参数,则执行模块可以无需再进行映射,否则,执行模块需要将使用自然语言描述的槽位映射为底层协议参数。
步骤12a和步骤12b:语音助手中的执行模块将语音指令的执行结果发送给处理模块和NLG模块。
如果该语音指令是由本设备执行的,则执行模块可以直接获取到该语音指令的执行结果;如果该语音指令是由其它电子设备执行的,则执行模块可以从该语音指令的执行设备接收该语音指令的执行结果。
本流程中,以执行结果为执行成功作为例子进行描述。
步骤13:语音助手中的处理模块根据执行结果,确定语音指令执行成功,则将该语音指令,进一步的连同该语音指令对应的关联信息,存储到历史语音指令集合中。
根据历史语音指令集合中的历史语音指令的数据要求,历史数据模块可以按照该格式要求,将该语音指令存储到历史语音指令集合中。历史语音指令集合中的历史语音指令的数据要求可以参见前文描述。
步骤14:语音助手中的NLG模块根据执行结果转换为自然语言,得到执行结果的文本数据。
步骤15:语音助手中的NLG模块将执行结果的文本数据发送给TTS模块。
步骤16:语音助手中的TTS模块将执行结果的文本数据,转换为音频数据。
步骤17:语音助手中的TTS模块将执行结果的音频数据发送给管理模块。
步骤18:语音助手中的管理模块将该执行结果的音频数据发送给音频模块(比如扬声器),使得该音频模块输出相应的语音,以使通知用户该语音指令的执行结果。
在另一种可能的实现方式中,处理模块也可以将补充完整的语音指令发送给执行模块,而无需发送给NLU模块。
需要说明的是,如果语音助手部署在云端,则电子设备在接收到语音指令的音频数据后,将该音频数据发送给云端,由云端的语音助手进行相应处理。可选的,电子设备还可以将对应的关联信息(比如该电子设备的设备信息、位置信息等)发送给云端,这些信息可以缓存在云端,以便处理模块基于这些信息进行相应处理操作。可选的,云端接收到该语音指令的音频数据后,管理模块可以基于该音频数据获得对应的用户身份信息,还可以获得当前登录该语音助手的用户账户信息,并将这些信息缓存,以便处理模块基于这些信息进行相应处理操作。
在另一种可能的实现方式中,如果语音助手部署在云端,则云端的语音助手中可以包括执行模块,也可以不包括执行模块。如果云端的语音助手中包括执行模块,则可选的,该执行模块可以将语音指令转换为执行设备可识别的控制指令,并将该控制指令发送给该电子设备(即该语音指令的接收设备),从而由该电子设备执行该控制指令(如果该电子设备为语音指令的执行设备),或者由该电子设备发送给执行设备(如果该电子设备不是该语音指令的执行设备),当然,云端也可以将该控制指令发送给执行设备,从而可以实现远程控制。如果云端的语音助手中不包括执行模块,则可选的,云端可以将补充完整的语音指令发送给该电子设备(即该语音指令的接收设备),由该接收设备中的执行模块进行相应处理。
需要说明的是,上述图10所示的流程仅为一种示例,本申请实施例对上述流程中的步骤时序不做限 定,比如步骤2a也可以在步骤2b之后执行。
还需要说明的是,上述图10所示的流程中,一些步骤可以是可选步骤,比如步骤12b~18也可以不执行。
根据上述图10所示的流程,在用户使用简化的语音指令(比如缺失槽位)的场景下,可以基于历史语音指令集合,确定当前语音指令的缺失部分,从而将当前语音指令补充为语义完整的语音指令,一方面可以保证该语音指令能够被执行,另一方面可以提高用户感受。
参见图12,为本申请实施例提供的另一种语音指令处理方法的流程示意图。该流程中,针对缺失槽位(比如缺失执行设备)的当前语音指令,电子设备中的语音助手可以请求云端基于历史语音指令集合获取与该语音指令相关的历史语音指令,语音助手获得该相关的历史语音指令后,将该历史语音指令的槽位添加为当前的语音指令中,使得当前的语音指令完整。
图12所示的流程中,语音助手的部分功能部署在电子设备中,部分功能部署在云端,即该流程可以应用于图8b所示的架构。
参见图13,为本申请实施例提供的另一种语音指令处理方法的流程示意图。该流程中,针对缺失槽位(比如缺失执行设备)的当前语音指令,云端中的语音助手可以基于历史语音指令集合获取与该语音指令相关的历史语音指令,获得该相关的历史语音指令后,将该历史语音指令的槽位添加为当前的语音指令中,使得当前的语音指令完整。
图13所示的流程中,语音助手的部分功能部署在电子设备中,部分功能部署在云端,即该流程可以应用于图8c所示的架构。
根据上述图12或图13所示的流程,在用户使用简化的语音指令(比如缺失槽位)的场景下,可以基于历史语音指令集合,确定当前语音指令的缺失部分,从而将当前语音指令补充为语义完整的语音指令,一方面可以保证该语音指令能够被执行,另一方面可以提高用户感受。由于历史语音记录集合位于云端,并且多个电子设备可以共享该历史语音记录集合,这样可以实现跨设备的语音接续。另一方面,由云端执行语音接续(即查询相关的历史语音记录,以便获取当前语音指令缺失的部分),可以减少终端侧的处理开销。
需要说明的是,图10、图12、图13仅基于语音助手的一些可能的部署方式,描述了语音指令的处理过程,在语音助手中各功能的部署方式采用其它方式的情况下,语音指令的处理流程将进行相应调整,在此不再一一列举。
根据以上各实施例,下面示例性示出了几种场景下的语音接续的示例。
示例一
示例一的场景为:用户先对手机说“在客厅电视播放电影”,再对音箱说“只要动作片”。手机收到语音指令“在客厅电视播放电影”后,向客厅电视发送控制指令,使得客厅电视响应于该控制指令,显示电影主界面;音箱收到语音指令“只要动作片”后,向客厅电视发送控制指令,使得客厅电视响应于该控制指令,选择电影主界面中的动作片电影进行播放。
图14a示出了上述场景的一种系统架构,手机和音箱共享历史语音指令集合,换言之,手机和音箱本地分别存储有历史语音指令集合,手机和音箱之间可以通过交互以实现历史语音指令集合的同步。这样,音箱接收到语音指令“只要动作片”后,由于本地存储的历史语音指令集合中包括在此之前手机接收和处理过的语音指令“在客厅电视播放电影”,因此音箱中的语音助手可以通过相关度匹配,确定该语音指令“在客厅电视播放电影”与当前语音指令“只要动作片”相关,进而确定当前语音指令“只要动作片”的执行设备为该客厅电视。
图14b示出了上述场景的另一种系统架构,手机和音箱本地没有存储历史语音指令集合,历史语音指令集合存储于云端,手机和音箱均可以访问该历史语音指令集合。这样,音箱接收到语音指令“只要动作片”后,由于云端存储的历史语音指令集合中包括在此之前手机接收和处理过的语音指令“在客厅电视播放电影”,因此音箱中的语音助手可以通过相关度匹配,确定该语音指令“在客厅电视播放电影”与当前语音指令“只要动作片”相关,进而确定当前语音指令“只要动作片”的执行设备为该客厅电视。
示例一中,手机和音箱处理语音指令的具体实现方式可参见前述实施例。
示例二
示例二的场景为:用户先对电视说“播放刘德华的电影”,再对平板电脑说“只要动作片”。电视收到语音指令“播放刘德华的电影”后,由于该语音指令中不包含执行设备信息,因此电视中的语音助手可以选择默认的执行设备(比如本电视为默认的执行设备)响应该语音指令,显示刘德华出演的电影列表;平板电脑收到语音指令“只要动作片”后,向该电视发送控制指令,使得该电视响应于该控制指令,选择该电影列表中的动作片进行播放。
图15a示出了上述场景的一种系统架构,电视和平板电脑共享历史语音指令集合。这样,平板电脑接收到语音指令“只要动作片”后,由于本地存储的历史语音指令集合中包括在此之前电视接收和处理过的语音指令“播放刘德华的电影”,因此平板电脑中的语音助手可以通过相关度匹配,确定该语音指令“播放刘德华的电影”与当前语音指令“只要动作片”相关,进而确定当前语音指令“只要动作片”的执行设备,与语音指令“播放刘德华的电影”的执行设备相同。
图15b示出了上述场景的另一种系统架构,电视和平板电脑本地没有存储历史语音指令集合,历史语音指令集合存储于云端,电视和平板电脑均可以访问该历史语音指令集合。这样,平板电脑接收到语音指令“只要动作片”后,由于云端存储的历史语音指令集合中包括在此之前电视接收和处理过的语音指令“播放刘德华的电影”,因此平板电脑中的语音助手可以通过相关度匹配,确定该语音指令“播放刘德华的电影”与当前语音指令“只要动作片”相关,进而确定当前语音指令“只要动作片”的执行设备,与语音指令“播放刘德华的电影”的执行设备相同。
示例三
示例三的场景为:用户先对音箱说“播放刘德华的电影”,再对平板电脑说“只要动作片”。音箱收到语音指令“播放刘德华的电影”后,由于该语音指令中不包含执行设备信息,因此语音助手可以选择默认的执行设备(比如客厅电视为默认的视频播放操作的执行设备)响应于该语音指令,客厅电视显示刘德华出演的电影列表;平板电脑收到语音指令“只要动作片”后,向该客厅电视发送控制指令,使得该客厅电视响应于该控制指令,选择该电影列表中的动作片进行播放。
上述场景的一种系统架构中,音箱和平板电脑共享历史语音指令集合。这样,平板电脑接收到语音指令“只要动作片”后,由于本地存储的历史语音指令集合中包括在此之前的语音指令“播放刘德华的电影”的相关数据,因此平板电脑中的语音助手可以通过相关度匹配,确定该语音指令“播放刘德华的电影”与当前语音指令“只要动作片”相关,进而确定当前语音指令“只要动作片”的执行设备,与语音指令“播放刘德华的电影”的执行设备相同。
上述场景的另一种系统架构中,音箱和平板电脑本地没有存储历史语音指令集合,历史语音指令集合存储于云端,音箱和平板电脑均可以访问该历史语音指令集合。这样,平板电脑接收到语音指令“只要动作片”后,由于云端存储的历史语音指令集合中包括在此之前的语音指令“播放刘德华的电影”,因此平板电脑中的语音助手可以通过相关度匹配,确定该语音指令“播放刘德华的电影”与当前语音指令“只要动作片”相关,进而确定当前语音指令“只要动作片”的执行设备,与语音指令“播放刘德华的电影”的执行设备相同。
本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质用于存储计算机程序,该计算机程序被计算机执行时,所述计算机可以实现上述方法实施例提供的方法。
本申请实施例还提供一种计算机程序产品,所述计算机程序产品用于存储计算机程序,该计算机程序被计算机执行时,所述计算机可以实现上述方法实施例提供的方法。
本申请实施例还提供一种芯片,包括处理器,所述处理器与存储器耦合,用于调用所述存储器中的程序使得所述芯片实现上述方法实施例提供的方法。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备 的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (29)

  1. 一种语音指令处理方法,其特征在于,所述方法包括:
    获取第一语音指令;
    确定所述第一语音指令的意图,并根据所述第一语音指令的意图确定所述第一语音指令缺失槽位;
    获取历史语音指令集合中的第二语音指令,所述第二语音指令与所述第一语音指令相关;
    根据所述第二语音指令的槽位确定所述第一语音指令的槽位。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述第二语音指令的槽位确定所述第一语音指令的槽位包括:
    所述第一语音指令缺失的槽位由所述第二语音指令对应的槽位提供。
  3. 如权利要求1或2所述的方法,其特征在于,所述根据所述第二语音指令的槽位确定所述第一语音指令的槽位之后,所述方法还包括:
    将所述第二语音指令的槽位添加到所述第一语音指令中;
    将添加有所述槽位的第一语音指令存储到所述历史语音指令集合中。
  4. 如权利要求3所述的方法,其特征在于,所述将所述第二语音指令的槽位添加到所述第一语音指令中,包括:
    获取所述第二语音指令的结构化数据中的槽位,所述第二语音指令的结构化数据中的槽位为自然语言表述的槽位信息或为协议参数,所述协议参数为对所述自然语言表述的槽位信息进行映射后得到的协议参数;
    将所述第二语音指令的结构化数据中的槽位,添加到所述第一语音指令的结构化数据中。
  5. 如权利要求3或4所述的方法,其特征在于,所述方法还包括:
    若所述第一语音指令中包括用于指示所述第一语音指令的槽位的指示代词,则将所述指示代词从所述第一语音指令中删除。
  6. 如权利要求1-5任一项所述的方法,其特征在于,所述获取历史语音指令集合中的第二语音指令,包括:
    根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,获取所述历史语音指令集合中与所述第一语音指令相关的第二语音指令。
  7. 如权利要求6所述的方法,其特征在于,所述根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,获取所述历史语音指令集合中与所述第一语音指令相关的第二语音指令,包括:
    根据所述第一语音指令、所述第一语音指令的意图和/或所述第一语音指令对应的关联信息,以及所述历史语音指令集合中各历史语音指令、各历史语音指令的意图和/或对应的关联信息,确定所述第一语音指令与所述历史语音指令集合中各历史语音指令的相关度;其中,所述第一语音指令对应的关联信息是在接收到所述第一语音指令时收集的,所述历史语音指令对应的关联信息是在接收到所述历史语音指令时收集的;
    根据所述第一语音指令与所述历史语音指令集合中各历史语音指令的相关度,从所述历史语音指令集合中选取与所述第一语音指令相关的第二语音指令。
  8. 如权利要求1-5任一项所述的方法,其特征在于,所述获取历史语音指令集合中的第二语音指令,包括:
    第一电子设备向云端或第三电子设备发送第一请求消息,所述第一请求消息用于请求获取所述历史语音指令集合中与所述第一语音指令关联的语音指令;其中,所述第一电子设备为所述第一语音指令的接收设备;
    所述第一电子设备接收所述云端或所述第三电子设备发送的第一响应消息,所述第一响应消息中携带有所述第二语音指令,所述第二语音指令是根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,从所述历史语音指令集合中获取到的。
  9. 如权利要求8所述的方法,其特征在于,所述第一请求消息携带所述第一语音指令、第一语音指令的意图和/或所述第一语音指令对应的关联信息。
  10. 如权利要求7或9所述的方法,其特征在于,所述第一语音指令对应的关联信息,包括以下至 少一项:
    设备信息,所述设备信息为所述第一语音指令的接收设备的信息;
    用户账户信息,所述用户账户信息为登录语音助手的用户账户信息;
    位置信息,所述位置信息为所述第一语音指令的接收设备的位置信息;
    时间信息,所述时间信息包括所述第一语音指令的接收时间,和/或,所述第一语音指令与前一个接收到的语音指令之间的时间间隔;
    用户身份信息,所述用户身份信息与所述第一语音指令的音频数据的特征信息相关联。
  11. 如权利要求1-5任一项所述的方法,其特征在于,所述获取历史语音指令集合中的第二语音指令,以及所述根据所述第二语音指令的槽位确定所述第一语音指令的槽位,包括:
    云端根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,获取所述历史语音指令集合中与所述第一语音指令相关的第二语音指令;
    根据所述第二语音指令的槽位确定所述第一语音指令的槽位,所述第一语音指令缺失的槽位由所述第二语音指令对应的槽位提供。
  12. 如权利要求1-5任一项所述的方法,其特征在于:
    所述获取第一语音指令,包括:
    云端对来自第一电子设备的所述第一语音指令的音频数据进行转换,得到对应的文本数据;
    所述确定所述第一语音指令的意图,并根据所述第一语音指令的意图确定所述第一语音指令缺失槽位,包括:
    所述云端对所述文本数据进行解析,得到所述第一语音指令的意图,并根据所述第一语音指令的意图确定所述第一语音指令缺失槽位;
    所述获取历史语音指令集合中的第二语音指令,以及所述根据所述第二语音指令的槽位确定所述第一语音指令的槽位,包括:
    所述云端获取历史语音指令集合中的第二语音指令,并根据所述第二语音指令的槽位确定所述第一语音指令的槽位。
  13. 如权利要求1-12任一项所述的方法,其特征在于,所述历史语音指令集合中包括历史语音指令的结构化数据,所述历史语音指令的结构化数据包括意图和槽位。
  14. 如权利要求1-13任一项所述的方法,其特征在于,所述槽位为执行语音指令的意图的设备或应用或服务。
  15. 一种语音指令处理系统,其特征在于,包括:
    自动语音识别模块,用于将第一语音指令的音频数据转换为文本数据;
    自然语言理解模块,用于对所述第一语音指令的文本数据进行解析,得到所述第一语音指令的意图;
    处理模块,若根据所述第一语音指令的意图确定所述第一语音指令缺失槽位,则获取历史语音指令集合中的第二语音指令,根据所述第二语音指令的槽位确定所述第一语音指令的槽位;其中,所述第二语音指令与所述第一语音指令相关。
  16. 如权利要求15所述的系统,其特征在于,所述第一语音指令缺失的槽位由所述第二语音指令对应的槽位提供。
  17. 如权利要求15或16所述的系统,其特征在于,所述处理模块,还用于:
    在根据所述第二语音指令的槽位确定所述第一语音指令的槽位之后,将所述第二语音指令的槽位添加到所述第一语音指令中,将添加有所述槽位的第一语音指令存储到所述历史语音指令集合中。
  18. 如权利要求17所述的系统,其特征在于,所述处理模块,具体用于:
    获取所述第二语音指令的结构化数据中的槽位,所述第二语音指令的结构化数据中的槽位为自然语言表述的槽位信息或为协议参数,所述协议参数为对所述自然语言表述的槽位信息进行映射后得到的协议参数;
    将所述第二语音指令的结构化数据中的槽位,添加到所述第一语音指令的结构化数据中。
  19. 如权利要求17或18所述的系统,其特征在于,所述处理模块,还用于:
    若所述第一语音指令中包括用于指示所述第一语音指令的槽位的指示代词,则将所述指示代词从所述第一语音指令中删除。
  20. 如权利要求15-19任一项所述的系统,其特征在于,所述处理模块,具体用于:
    根据所述第一语音指令与历史语音指令集合中的历史语音指令的相关度,获取所述历史语音指令集合中与所述第一语音指令相关的第二语音指令。
  21. 如权利要求20所述的系统,其特征在于,所述处理模块,具体用于:
    根据所述第一语音指令、所述第一语音指令的意图和/或所述第一语音指令对应的关联信息,以及所述历史语音指令集合中各历史语音指令、各历史语音指令的意图和/或对应的关联信息,确定所述第一语音指令与所述历史语音指令集合中各历史语音指令的相关度;其中,所述第一语音指令对应的关联信息是在接收到所述第一语音指令时收集的,所述历史语音指令对应的关联信息是在接收到所述历史语音指令时收集的;
    根据所述第一语音指令与所述历史语音指令集合中各历史语音指令的相关度,从所述历史语音指令集合中选取与所述第一语音指令相关的第二语音指令。
  22. 如权利要求21所述的系统,其特征在于,所述第一语音指令对应的关联信息,包括以下至少一项:
    设备信息,所述设备信息为所述第一语音指令的接收设备的信息;
    用户账户信息,所述用户账户信息为登录语音助手的用户账户信息;
    位置信息,所述位置信息为所述第一语音指令的接收设备的位置信息;
    时间信息,所述时间信息包括所述第一语音指令的接收时间,和/或,所述第一语音指令与前一个接收到的语音指令之间的时间间隔;
    用户身份信息,所述用户身份信息与所述第一语音指令的音频数据的特征信息相关联。
  23. 如权利要求15-22任一项所述的系统,其特征在于,所述历史语音指令集合中包括历史语音指令的结构化数据,所述历史语音指令的结构化数据包括意图和槽位。
  24. 如权利要求15-23任一项所述的系统,其特征在于,所述槽位为执行语音指令的意图的设备或应用或服务。
  25. 如权利要求15-24任一项所述的系统,其特征在于:
    所述自动语音识别模块、所述自然语言理解模块、所述处理模块位于第一电子设备;或者
    所述自动语音识别模块、所述自然语言理解模块位于第一电子设备,所述处理模块位于云端或第三电子设备;或者
    所述自动语音识别模块位于第一电子设备,所述自然语言理解模块、所述处理模块位于云端;或者
    所述自动语音识别模块、所述自然语言理解模块、所述处理模块位于云端。
  26. 如权利要求15-25任一项所述的系统,其特征在于,还包括:
    执行模块,用于根据所述第一语音指令的意图和槽位,执行所述第一语音指令或指示所述第一语音指令的执行设备执行所述第一语音指令,所述执行设备由所述第一语音指令的槽位提供。
  27. 如权利要求26所述的系统,其特征在于,还包括:自然语言生成模块、文本转语音模块;
    所述执行模块,还用于获取所述第一语音指令的执行结果;
    所述自然语言生成模块,用于将所述第一语音指令的执行结果转换为文本数据,所述文本数据为文本格式的自然语言;
    所述文本转语音模块,用于将所述文本数据转换为音频数据。
  28. 一种电子设备,其特征在于,包括:一个或多个处理器;所述一个或多个存储器存储有一个或多个计算机程序,所述一个或多个计算机程序包括指令,当所述指令被所述一个或多个处理器执行时,使得所述电子设备执行如权利要求1-14中任意一项所述的方法。
  29. 一种计算机可读存储介质,其特征在于,包括计算机程序,当所述计算机程序在电子设备上运行时,使得所述电子设备执行如权利要求1-14中任意一项所述的方法。
PCT/CN2023/104190 2022-07-01 2023-06-29 一种语音指令处理方法、装置、系统以及存储介质 WO2024002298A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210775676.2A CN117373445A (zh) 2022-07-01 2022-07-01 一种语音指令处理方法、装置、系统以及存储介质
CN202210775676.2 2022-07-01

Publications (1)

Publication Number Publication Date
WO2024002298A1 true WO2024002298A1 (zh) 2024-01-04

Family

ID=89383327

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/104190 WO2024002298A1 (zh) 2022-07-01 2023-06-29 一种语音指令处理方法、装置、系统以及存储介质

Country Status (2)

Country Link
CN (1) CN117373445A (zh)
WO (1) WO2024002298A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228366A1 (en) * 2016-02-05 2017-08-10 Adobe Systems Incorporated Rule-based dialog state tracking
CN110111787A (zh) * 2019-04-30 2019-08-09 华为技术有限公司 一种语义解析方法及服务器
CN110136705A (zh) * 2019-04-10 2019-08-16 华为技术有限公司 一种人机交互的方法和电子设备
CN111274368A (zh) * 2020-01-07 2020-06-12 北京声智科技有限公司 槽位填充方法及装置
CN111723574A (zh) * 2020-07-09 2020-09-29 腾讯科技(深圳)有限公司 一种信息处理方法、装置及计算机可读存储介质
US20200410395A1 (en) * 2019-06-26 2020-12-31 Samsung Electronics Co., Ltd. System and method for complex task machine learning
CN112581944A (zh) * 2019-09-29 2021-03-30 北京安云世纪科技有限公司 一种语音指令响应方法、装置及终端设备
CN113270096A (zh) * 2021-05-13 2021-08-17 前海七剑科技(深圳)有限公司 语音响应方法、装置、电子设备及计算机可读存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170228366A1 (en) * 2016-02-05 2017-08-10 Adobe Systems Incorporated Rule-based dialog state tracking
CN110136705A (zh) * 2019-04-10 2019-08-16 华为技术有限公司 一种人机交互的方法和电子设备
CN110111787A (zh) * 2019-04-30 2019-08-09 华为技术有限公司 一种语义解析方法及服务器
US20200410395A1 (en) * 2019-06-26 2020-12-31 Samsung Electronics Co., Ltd. System and method for complex task machine learning
CN112581944A (zh) * 2019-09-29 2021-03-30 北京安云世纪科技有限公司 一种语音指令响应方法、装置及终端设备
CN111274368A (zh) * 2020-01-07 2020-06-12 北京声智科技有限公司 槽位填充方法及装置
CN111723574A (zh) * 2020-07-09 2020-09-29 腾讯科技(深圳)有限公司 一种信息处理方法、装置及计算机可读存储介质
CN113270096A (zh) * 2021-05-13 2021-08-17 前海七剑科技(深圳)有限公司 语音响应方法、装置、电子设备及计算机可读存储介质

Also Published As

Publication number Publication date
CN117373445A (zh) 2024-01-09

Similar Documents

Publication Publication Date Title
US11818420B2 (en) Cross-device content projection method and electronic device
WO2021027267A1 (zh) 语音交互方法、装置、终端及存储介质
US11031011B2 (en) Electronic device and method for determining electronic device to perform speech recognition
JP2022525284A (ja) デバイス制御ページ表示方法、関連装置、およびシステム
CN110795179B (zh) 一种显示方法及电子设备
WO2020173375A1 (zh) 一种多智能设备联动控制的方法、设备以及系统
CN116114239B (zh) 音量管理的方法及电子设备
CN116009999A (zh) 卡片分享方法、电子设备及通信系统
CN113504887A (zh) 一种音量设置方法及电子设备
WO2022143258A1 (zh) 一种语音交互处理方法及相关装置
WO2023273321A1 (zh) 一种语音控制方法及电子设备
WO2022088964A1 (zh) 一种电子设备的控制方法和装置
CN113742460B (zh) 生成虚拟角色的方法及装置
CN113703849A (zh) 投屏应用打开方法和装置
WO2024002298A1 (zh) 一种语音指令处理方法、装置、系统以及存储介质
EP4343756A1 (en) Cross-device dialogue service connection method, system, electronic device, and storage medium
CN113746945A (zh) 反向地址解析方法及电子设备
CN116384342A (zh) 语义转换方法、装置、设备、存储介质及计算机程序
CN116056050A (zh) 播放音频的方法、电子设备及系统
CN113918246A (zh) 功能控制方法、功能控制装置、存储介质与电子设备
CN112216279A (zh) 语音传输方法、智能终端及计算机可读存储介质
JP2019091444A (ja) スマートインタラクティブの処理方法、装置、設備及びコンピュータ記憶媒体
WO2023005844A1 (zh) 设备唤醒方法、相关装置及通信系统
CN115035894B (zh) 一种设备响应方法和装置
CN114172925B (zh) 配网方法及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23830462

Country of ref document: EP

Kind code of ref document: A1