WO2022134110A1 - 一种语音理解方法及装置 - Google Patents

一种语音理解方法及装置 Download PDF

Info

Publication number
WO2022134110A1
WO2022134110A1 PCT/CN2020/139712 CN2020139712W WO2022134110A1 WO 2022134110 A1 WO2022134110 A1 WO 2022134110A1 CN 2020139712 W CN2020139712 W CN 2020139712W WO 2022134110 A1 WO2022134110 A1 WO 2022134110A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
slot
intent
voice
user
Prior art date
Application number
PCT/CN2020/139712
Other languages
English (en)
French (fr)
Inventor
苏琪
聂为然
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2020/139712 priority Critical patent/WO2022134110A1/zh
Priority to EP20966673.4A priority patent/EP4250286A4/en
Priority to CN202080004169.8A priority patent/CN112740323B/zh
Publication of WO2022134110A1 publication Critical patent/WO2022134110A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Definitions

  • the present invention relates to the technical field of natural language understanding, in particular to a speech understanding method.
  • Voice understanding systems have been widely used in various scenarios, such as smart phones, smart homes, smart cockpits, etc.
  • the voice understanding system can recognize the user's intent by understanding the user's voice input, so that the user can control related functions through voice commands.
  • a speech understanding system is usually connected to multiple applications, so that a user can control multiple functions of multiple applications through one speech understanding system.
  • a newly accessed application defines a semantic expression template, and the speech understanding system builds a semantic understanding model dedicated to the semantic expression template, so as to realize the adaptation of the speech understanding system to the application.
  • the generalization ability of the semantic expression template under this scheme is very weak, and it is difficult to support the flexible expression of users.
  • user instruction data corresponding to the newly accessed application needs to be collected in advance, and the semantic understanding model is retrained by using the marked data, so as to realize the adaptation of the speech understanding system to the application.
  • each time a newly accessed application is adapted it is necessary to re-collect data, label data, and train the model.
  • the adaptation period is long, and the cost of manpower and material resources is high.
  • the embodiments of the present invention provide a speech understanding method, equipment, device, computer storage medium and computer program product.
  • the speech understanding method can support various flexible expressions of users, and has stronger
  • the voice understanding ability can be adapted to all registered applications, and there is no need to re-collect data, label data and train models when adapting to new applications. It has the advantages of short adaptation period and low cost.
  • “intent” refers to an operation performed on data or resources, and can be named using verb-object phrases.
  • An “intent” may correspond to one or more "slots”.
  • the "book flight” intent corresponds to "departure time”, “departure place”, “destination”, “cabin type”, etc. slots.
  • “Slots” are used to store attributes of data or resources, such as “departure time”, “departure place”, “destination”, and "cabin type” of the ticket.
  • the "slot value” refers to the specific attribute of the data or resources stored in the "slot", for example, the "destination” of the ticket is "Beijing", and the "departure time” of the ticket is "tomorrow”.
  • a first aspect of the embodiments of the present invention provides a speech understanding method, including:
  • the intent information and slot information in the comprehensive semantic library come from multiple registered applications;
  • the intent of the first voice information and the slot value in the first voice information are transmitted to the target application.
  • the registered application is an application adapted to the speech understanding system.
  • the registered application may be an application installed in a terminal device and adapted to the speech understanding system.
  • the speech understanding system stores the received intent information and slot information in the comprehensive semantic library.
  • the semantic understanding system can directly save the received information one by one in the comprehensive semantic library, that is, the comprehensive semantic library can contain data packets corresponding to each registered application, and each data packet includes the corresponding intention information and needs of the application execution. slot information.
  • the semantic understanding system can also integrate the information after receiving the information sent by the application, and merge the same intent information and slot information. At this time, the intent information and slot information in the integrated semantic library are not in accordance with It is stored by each application, but there is a look-up table that indicates the corresponding relationship between the application and its intent information and slot information.
  • the user usually expresses one or more intentions through voice input, where the intention is to instruct the target application to execute a service corresponding to the intention.
  • the embodiment of the present invention supports any natural language expression that conforms to the user's habits. For example, when the user wants to express the intention of "booking a flight ticket”, the user is supported to adopt a more standardized and formatted method such as "book a flight ticket to Beijing for me”. It also supports users to use simpler and less informative expressions such as "I'm going to Beijing", and also supports users to use keyword expressions such as "book air tickets” and "book tickets”.
  • the embodiments of the present invention also include expressions in other manners.
  • the speech understanding system When an application accesses the speech understanding system, it can send a registration request to the speech understanding system, where the registration request includes the application's identifier and a data packet, and the data packet includes the intent information and slot information required by the application to perform services.
  • the intent information and slot information in the data packet specifically refer to the description information of the intent and the description information of the slot.
  • the description information of the intent supports the application of flexible description methods.
  • the description information of the intent can be a relatively formatted description method such as "book a ticket to XX city on XX date", etc. It can be a keyword-based description such as "book air tickets”.
  • the description information of the slot also supports the application of flexible description methods.
  • the description information of the slot can be a description method such as "the name of a city", or a description method such as "arrival place”. Keyword-like descriptions such as "destination”.
  • the slot information required by the application to execute a specific business including the optional slot and the fill-in slot, where the optional slot refers to the slot whose value is a predefined limited set, and the fill-in slot It means that the value of the slot is not a pre-defined slot, and the value of the slot may come directly from the segment in the user instruction (extraction), or it may be generated by the speech understanding system according to the user instruction (generation).
  • the option slot can be "judging whether the ticket is a direct ticket", and the slot value corresponding to the option slot can include two possible slot values, "Yes" and "No".
  • the fill-in slot can be "the destination of the ticket", which is an extractable fill-in slot, and the slot value can be the place names such as "Beijing" and "Shanghai" included in the user's voice input.
  • the intent information and slot information in the data packet in the registration request can be stored in a comprehensive semantic library.
  • the comprehensive semantic library contains data packets of one or more applications, and applications and data packets can be in one-to-one correspondence. relationship, that is, the data packet corresponding to an application contains the intent information and slot information required by the application to perform services.
  • the comprehensive semantic library is in a state of continuous updating. Every time a new application is accessed, the data package corresponding to the newly accessed application will be saved in the comprehensive semantic library, that is, the comprehensive semantic library will be updated once, so that the comprehensive semantic library will be updated. Always include packages for all registered apps.
  • the voice input can be converted into text content through voice recognition, which is convenient for subsequent processing.
  • the text content corresponding to the user's voice input is input together with the intent information and slot information in the comprehensive semantic library into the semantic understanding model, and the semantic understanding model realizes according to the intent information in the comprehensive semantic library.
  • identify the intent of the first voice input and identify the slot value in the first voice input according to the slot information in the comprehensive semantic library.
  • the intent and the slot value in the user's voice instruction are transmitted to the corresponding target application.
  • the "transfer" can be implemented in various ways.
  • the speech understanding system and the business application share a central processing unit (CPU, Central Processing Unit)
  • the transmission of intent and slot value can be realized through program calls.
  • the functions of the speech understanding system and the business application are executed by different processors, for example, when the speech understanding system is arranged in the cloud, or the speech understanding system is arranged on a separate device, the intent and slot can be realized through network communication transmission of values.
  • the semantic understanding model in the embodiment of the present invention can identify the intent and the slot value from the voice input expressed in various ways. Therefore, the voice understanding method provided in the embodiment of the present invention can support various flexible expression modes of the user, and has generalization The advantage of strong ability.
  • the comprehensive semantic library always contains the intent information and slot information of all registered applications.
  • the semantic understanding model can judge whether the user's voice input contains the corresponding intent and slot value according to the intent information and slot information of each registered application. Therefore, the voice understanding method provided by the embodiment of the present invention can support the voice control function of all registered applications, can achieve "plug and play" of new applications, shorten the adaptation period between the voice understanding system and the new application, and reduce adaptation The cost of new applications.
  • the speech understanding method further includes:
  • the first interface may be the main interface of the target application, that is, after the target application corresponding to the first voice information is determined, the main interface of the target application is displayed for the user to perform subsequent operations.
  • the first interface may also display all optional items corresponding to the intent in the first voice information and satisfying the conditions of the first voice information for the user to select. For example, when the first voice message expresses the intention of "booking a flight ticket", all the flight tickets that match the slot value in the first voice message may be displayed on the first interface for the user to make further selection.
  • the first interface of the target application is displayed, and the result of the voice understanding can be displayed to the user to provide the user with the possibility of further operation or determination.
  • the first interface confirms the correctness of speech comprehension, which improves the visibility and accuracy of speech comprehension.
  • the intent of the first voice information and the slot value in the first voice information are transmitted to the target application.
  • the speech understanding method further includes:
  • a result of the target application performing the target operation is output, where the target operation is an operation determined by the intent of the first voice information and the slot value in the first voice information.
  • the result of the target application performing the target operation can be output in various ways.
  • the result of the target application performing the target operation can be displayed by means of screen display; the result of the target application performing the target operation can also be displayed by means of a voice prompt, and a sound feedback can be specified to indicate that the execution is successful. , and another sound feedback indicates that the execution fails; feedback can also be performed by voice broadcast, and the execution result can be directly broadcasted by voice.
  • the voice understanding system can let users know whether their voice commands are correctly recognized and executed by obtaining the feedback of the target application executing the corresponding service and feeding back the execution result of the corresponding service to the user. When the execution fails, it can try again or replace it with other methods. Execute to avoid unnecessary trouble due to the user's misunderstanding that the execution is successful but the execution is not actually successful.
  • the target application is determined according to the respective user usage frequencies or scores of the multiple candidate applications.
  • the application to be selected can be determined in various ways. For example, the application corresponding to the intent of the first voice information can be found by looking up a table. In addition, the application can also be searched one by one in the data packets of each application. way to find an application corresponding to the intent of the first voice message.
  • the score may be the user's score for the target application, or the network user's score for the target application. When determining the target application according to the usage frequency or rating of the user, only the usage frequency of the user may be considered, or only the rating of the user may be considered, or both the usage frequency and the user rating may be considered comprehensively.
  • a speech understanding system may access multiple applications that perform the same or similar business functions at the same time, and at this time, the same intent may correspond to multiple target applications.
  • the intent of "booking a flight ticket" may correspond to multiple target applications such as Ctrip, Fliggy, and Meituan at the same time.
  • multiple target applications can be sorted according to the user's usage frequency or user rating, and the target application ranked first is selected as the target application corresponding to the user's first voice command, and the user's voice input is
  • the intent and slot value are sent to the target application, and the target application executes the corresponding service.
  • the method selects the target application according to the user's usage habits or favorable rating, which is more suitable for the actual needs of the user, and can greatly improve the user experience.
  • the target application can also be jointly determined according to the intent and slot value in the user's voice input.
  • the user's usage frequency and rating can also be referred to on the basis of the intent and slot value in the user's voice input. Identify target applications together.
  • the speech understanding method further includes:
  • the user is requested to input the missing slot information in the first voice information.
  • this embodiment of the method receives feedback from the target application about the missing slot information, and requires the user to continue to input the missing slot value. By supplementing the missing slot value twice, the success rate of the target application executing a specific service can be improved, and the success rate of executing the user's voice command can be improved.
  • One piece of intent information in the comprehensive semantic library corresponds to at least one slot package, and one slot package includes one or more slot information.
  • the intent information When there is a corresponding relationship between the intent information and the slot information in the comprehensive semantic library, that is, when one intent corresponds to one or more slot packets, after the intent in the first voice information is determined, only the first voice information needs to be
  • the slot information in the slot package corresponding to the intent in the identifies the slot value in the first voice information, reduces the workload of the system, and improves the efficiency of voice recognition.
  • a slot package contains one or more mandatory slot information, and the mandatory slot information is the slot information necessary to execute the intent corresponding to the intent information;
  • the user is requested to input the missing slot information in the first voice information, and the target slot packet is The slot packet corresponding to the intent of the first voice message.
  • the required slot information refers to the slot information necessary for the application to execute a specific business.
  • the necessary slots for executing the business can include “departure time”, “departure place”, “ “Destination”, etc., in the absence of any one of the required slot values, the target application cannot normally perform specific services.
  • the optional slot value refers to the additional slot information when the application executes a specific business. Taking the business of "booking a flight ticket” as an example, the optional slot for executing the business can be "class of ticket", "seat preference", etc. In the absence of the slot value in the optional slot, the target application can still perform certain services normally.
  • an intent corresponds to a slot packet
  • the slot packet contains one or more required slots and one or more optional slots
  • the speech understanding system can identify the The intent and slot value in the user's first voice input are determined, and it is determined whether the user's first voice input contains all necessary slot values corresponding to the user's intent.
  • the intention and slot value in the user's first voice input are directly transmitted to the corresponding target application.
  • the user's first voice input does not contain all the necessary slot values corresponding to the user's intention, as described in the first implementation manner and the second implementation manner of the first aspect of the embodiment of the present invention, the user is required to input the missing value.
  • the slot value, and subsequent operations such as transmitting the re-acquired slot value in the user's voice input to the target application.
  • the speech understanding system can directly identify the missing slot information, ask the user to supplement the missing slot information, and send the complete intent and slot value to the corresponding target application at one time, which can reduce the relationship between the target application and the speech understanding system. The number of interactions between them can shorten the time of speech understanding, improve the efficiency of speech understanding, and greatly improve the user experience.
  • the speech understanding method further includes:
  • the slot value in the second voice information is transmitted to the target application.
  • this embodiment of the method requires the user to continue to input the missing slot value, obtain the user's second voice input, and use the user's second voice input to The slot value is transmitted to the corresponding target application to satisfy all the slot values required by the target application to perform a specific service.
  • the success rate of the target application executing a specific service can be improved, and the success rate of executing the user's voice command can be improved.
  • only the missing mandatory slot value can be expressed without repeating the user's first voice input, which avoids the user from repeatedly inputting instructions and can greatly improve the user experience.
  • a second aspect of the embodiments of the present invention provides a speech understanding device, including:
  • a communication module for receiving the first voice information
  • the data storage module is used to store intention information and slot information, and the intention information and slot information come from multiple registered applications;
  • the processing module is configured to match the intention of the first voice information according to the intention information in the data storage module, determine the target application according to the intention of the first voice information, and identify the first voice input according to the slot information in the data storage module. the slot value;
  • the communication module is further configured to send the intent of the first voice information and the slot value in the first voice information.
  • the speech understanding device provided in the second aspect of the embodiment of the present invention can be a virtual device and can be arranged in the cloud, and the speech understanding device can communicate with multiple end-side devices and assist multiple end-side devices to complete the task of speech understanding.
  • the terminal-side device can receive the user's voice information, and the intent information and slot information transmitted by the registered application installed in the terminal-side device.
  • the cloud-side speech understanding device collects intent information and slot information from multiple terminal-side devices, and stores them in the data storage module.
  • the terminal-side device needs to perform speech understanding, it sends the speech information to the cloud-side device, and the processing module of the cloud-side speech understanding device performs the operation of speech understanding, and only sends back the intent of the speech information and the slot in the speech information.
  • Device-side device the device-side device interacts with the target application to complete subsequent operations.
  • the processing module determines the target application according to the intent of the first voice information, specifically:
  • the processing module determines a plurality of candidate applications according to the intention of the first voice information, and determines a target application according to respective user usage frequencies or scores of the plurality of candidate applications.
  • One piece of intent information corresponds to at least one slot package, and one slot package includes at least one or more pieces of the slot information.
  • Identify the slot value in the first voice input according to the slot information specifically:
  • one piece of intent information corresponds to at least one slot packet, and one slot packet includes at least one or more slot information, Specifically:
  • a slot package contains one or more mandatory slot information, and the mandatory slot information is the slot information necessary to execute the intent corresponding to the intent information;
  • the communication module is also used to send the missing slot information in the first voice information, and the target slot
  • the bit packet is a slot packet corresponding to the intent of the first voice information.
  • the speech understanding system sends the missing slot information to the terminal-side device, and the terminal-side device interacts with the user and requests the user to supplement the missing slot. information.
  • the communication module after the communication module sends the slot information missing in the first voice information to the second device, the communication module also uses for obtaining second voice information;
  • the processing module is also used to identify the slot value in the second voice information according to the missing slot information
  • the communication module is further configured to send the slot value in the second voice information.
  • a third aspect of the embodiments of the present invention provides a speech understanding device, including:
  • a microphone for collecting the first voice information
  • the memory is used to store intention information and slot information, and the intention information and slot information come from multiple registered applications;
  • a processor configured to match the intention of the first voice information according to the intention information, determine the target application according to the intention of the first voice information, and identify the slot value in the first voice information according to the slot information;
  • the processor is further configured to transmit the intent of the first voice information and the slot value in the first voice information to the target application.
  • the speech understanding device further includes a display screen
  • the processor is configured to instruct the display screen to display the first interface of the target application according to the intention of the first voice information and the slot value in the first voice information.
  • the speech understanding device also includes an output device for outputting the result of the target application performing the target operation according to the processor's instruction after the processor provides the intention of the first speech information and the slot value in the first speech information to the target application.
  • the operation is an operation determined by the intent of the first voice information and the slot value in the first voice information.
  • the output device may be any device that can output feedback to the user, such as a speaker, a display screen, or the like.
  • the processor determines the target application according to the intent of the first voice information, specifically:
  • the processor determines a plurality of candidate applications according to the intent of the first voice information
  • the processor determines the target application according to the respective user usage frequencies or scores of the multiple candidate applications.
  • the processor is further configured to receive feedback information of the target application, where the feedback information includes slot information missing from the first voice information;
  • the speech understanding device further includes an output device for outputting a first request, where the first request is used for requesting the user to input slot information missing in the first speech information.
  • One piece of intent information corresponds to at least one slot packet, and one slot packet includes one or more slot information.
  • Identify the slot value in the first voice information according to the slot information specifically:
  • One intent information corresponds to at least one slot packet, and one slot packet includes one or more slot information, specifically:
  • a slot package contains one or more mandatory slot information, and the mandatory slot information is the slot information necessary to execute the intent corresponding to the intent information;
  • the voice understanding device further includes an output device, and when the number of slot values identified from the first voice information is less than the number of required slot information in the target slot package, the output device is used to request the user to input the first voice information
  • the slot information missing in the target slot packet is the slot packet corresponding to the intent of the first voice information.
  • the microphone After requesting the user to input the missing slot information in the first voice information, the microphone is also used to collect the second voice information;
  • the processor is also used to identify the slot value in the second voice information according to the missing slot information
  • the processor is further configured to transmit the slot value in the second voice information to the target application.
  • a fourth aspect of the embodiments of the present invention provides a computer storage medium, including computer instructions, when the computer instructions are executed on a computer, the computer can execute the first aspect or the first seven possible implementations of the first aspect, and can achieve All the above beneficial effects.
  • a fifth aspect of the embodiments of the present invention provides a computer program product.
  • the computer program product runs on a computer, the computer can execute the first aspect or the first seven possible implementation manners of the first aspect, and can achieve all of the above beneficial effect.
  • Fig. 1 is a kind of speech understanding method that the embodiment of the present invention provides
  • 3 is a form of a data packet provided by an embodiment of the present invention.
  • FIG. 4 is a form of another data packet provided by an embodiment of the present invention.
  • FIG. 5 is a form of another data packet provided by an embodiment of the present invention.
  • FIG. 6 is a manner of performing semantic understanding in a semantic understanding model provided by an embodiment of the present invention.
  • FIG. 8 is an example implementation of the example method according to an embodiment of the present invention.
  • FIG. 9 is another exemplary implementation of the exemplary method according to an embodiment of the present invention.
  • 11 is a speech understanding device provided by an embodiment of the present invention.
  • FIG. 12 is a speech understanding device provided by an embodiment of the present invention.
  • the embodiments of the present invention can be applied to various scenarios in which an application is controlled by voice input to perform related functions.
  • the user controls the application software on the smart terminal device to perform the corresponding function through the voice input
  • the user controls the household appliance to execute the corresponding function through the voice input indoors
  • the user controls the function of the hardware device in the cockpit through the voice input in the car, or Multimedia system functions, etc.
  • a speech understanding system is usually connected to multiple applications, so that a user can control multiple functions through a speech understanding system.
  • the interfaces of applications in different vertical domains are very different.
  • the embodiments of the present application provide a speech understanding method, which can be applied to an intelligent terminal, can also be arranged in a cloud processor, and can also be applied to an independent speech understanding device.
  • the accessing application first sends a registration request to the speech understanding system, and the registration request includes a data packet, and the data packet includes the intent information and slot information required by the application to perform services.
  • the speech understanding system stores the data packets in the application's registration request in the comprehensive semantic database, and recognizes the intention and slot value in the user's voice input according to the intention information and slot information in the data packet provided by the application, so that the speech understanding
  • the system can flexibly adapt to various access applications, and has the advantages of short adaptation period and low cost.
  • the speech understanding model supports various flexible expressions of users and has stronger speech understanding ability.
  • a speech understanding method 100 is provided, and the method includes the following steps.
  • Step 110 obtaining the first voice input
  • the user's voice input usually expresses an intention, and the intention is to instruct the target application to perform a service corresponding to the intention.
  • the embodiment of the present invention supports any natural language expression that conforms to the user's habits. For example, when the user wants to express the intention of "booking a flight ticket”, the user is supported to adopt a relatively standard expression such as "book a flight ticket to Beijing for me” and so on. , also supports users to use simpler expressions such as "I'm going to Beijing", and also supports users to use keyword expressions such as "book air tickets” and "book tickets”.
  • the embodiments of the present invention also include expressions in other manners.
  • step 110 further includes a speech recognition step for converting the user's speech input into text for subsequent processing.
  • Step 120 Identify the intent of the first voice input according to the intent information in the integrated semantic database, and identify the slot value in the first voice input according to the slot information in the integrated semantic database.
  • the intent information and slot information in the integrated voice library come from the data packets in the application's registration request, that is, when the application is adapted to the speech understanding system, a registration request needs to be sent to the speech understanding system, and the registration request includes the The identifier of the application and the data packet containing the intent information and slot information required by the application to execute the service.
  • the specific form of the data packet will be described in detail below.
  • the speech understanding system After receiving the registration request of the newly accessed application, the speech understanding system saves the data packets in the registration request in the comprehensive semantic database for subsequent use in semantic recognition.
  • the comprehensive semantic library may contain data packages of one or more applications, wherein applications and data packages may be in a one-to-one correspondence, that is, an application corresponds to an intent that contains the application's needs to execute services Packets of information and slot information.
  • the integrated semantic library may organize and merge the intent and slot information in the received application registration request package. For example, applications of the same type may have the same intent, and in the integrated semantics The same intent can be combined and stored in the library. At this time, the concept of data packets does not exist in the comprehensive voice library.
  • the comprehensive semantic library is in a state of continuous updating. Every time a new application is accessed, the data package corresponding to the newly accessed application will be saved in the comprehensive semantic library, that is, the comprehensive semantic library will be updated once, so that the comprehensive semantic library will be updated. Always include all incoming packets of the application.
  • identifying the intent information in the user's first voice input according to the intent information in the integrated voice database and identifying the slot value in the user's first voice input according to the slot information in the integrated voice database It can be executed by the semantic understanding model, and the specific process of executing the semantic understanding by the semantic understanding model will be described in detail below.
  • Step 130 Determine the first voice input target application according to the intent of the first voice input.
  • a speech understanding system may simultaneously access multiple applications that perform the same or similar service functions, and in this case, the same intent may correspond to multiple target applications.
  • the intent of "booking a flight ticket" may correspond to multiple target applications such as Ctrip, Fliggy, and Meituan at the same time.
  • multiple target applications may be sorted according to the user's frequency of use, in another implementation manner, multiple target applications may be sorted according to user ratings, and in another implementation manner , multiple target applications can also be ranked in consideration of the user's usage frequency and user ratings. It should be understood that other possible ways of ordering the target applications are possible.
  • the target application ranked first can be selected as the target application corresponding to the user's first voice command, and the intent and slot value in the user's voice input are sent to the target application, and the corresponding service is executed by the target application.
  • the slot value included in the user's voice input may also be taken into consideration, and the target application is determined on the basis of comprehensively considering the intent and slot value in the user's voice input.
  • a feedback of speech understanding failure is output.
  • the feedback can be in any possible manner, for example, a warning of speech comprehension failure can be issued by means of screen display or voice broadcast, or the user can be required to re-enter voice input by means of screen display or voice broadcast. It should be understood that other feedback methods that can make the user aware of the speech recognition failure are also possible.
  • the interface of the target application may be displayed.
  • the main interface of the target application may be displayed for the user to operate.
  • the interface corresponding to the first The interface of the intent and slot value of the voice message, for example, when the user's intent is "book air tickets", all eligible air tickets can be displayed for the user to select.
  • Step 140 Send the intent and slot value in the first voice input to the target application.
  • the intent and the slot value in the user's voice instruction are transmitted to the corresponding target application.
  • the speech understanding system and the business application may share a central processing unit (CPU, Central Processing Unit).
  • the speech understanding system and the matched application are both on a smart terminal (for example: Siri voice assistant) , at this time, the transfer of intent and slot value can be achieved through program calls.
  • the speech understanding system can be arranged on the cloud side, and the application can be arranged on the terminal side. In this case, the intent and slot value can be transmitted through network communication.
  • the functions of the speech understanding system and the business application may be executed by different processors, for example, the speech understanding system may be controlled by a separate device for controlling other peripheral devices (for example: small Degree artificial intelligence assistant, etc.), at this time, the transmission of intent and slot value can be realized through network communication.
  • the speech understanding system may be controlled by a separate device for controlling other peripheral devices (for example: small Degree artificial intelligence assistant, etc.)
  • the transmission of intent and slot value can be realized through network communication.
  • other arrangements of speech understanding systems and applications are also possible, and correspondingly, other implementations capable of transmitting the intent and slot value in the user's speech command to the corresponding target application are also possible.
  • the speech understanding method 100 further includes the following steps.
  • Step 150 The user is asked to input the missing slot value.
  • the user's first voice input lacks a mandatory slot value
  • the user is required to continue to input the missing slot value.
  • the speech understanding system may ask the user to continue entering the missing slot value in any possible way.
  • the user may be required to input the missing slot value through the screen display, for example, it may be displayed on the screen: please input the slot value of the XX slot.
  • the user may be required to input the missing slot value by means of voice broadcast, for example, using speech synthesis (TTS, Text to Speech) to convert the instruction requiring the user to input the missing slot value into a voice output broadcast.
  • speech synthesis TTS, Text to Speech
  • Step 160 Obtain the second voice input.
  • the user's second voice input may only include the missing slot value, for example, using keyword expressions such as "Beijing" and "Shanghai” to supplement the slot value of the "destination" slot.
  • the user's second voice input may include the first voice input and the missing slot value in the first voice input, for example, if the first voice input only expresses "Book me a flight ticket for tomorrow" , the second voice input can be "help me book a ticket to Beijing tomorrow".
  • Step 170 Send the slot value in the second voice input to the target application.
  • the implementation manner of transmitting the slot value in the second voice input to the target application in step 170 is the same as that in step 140, and in order to avoid repetition, details are not repeated here.
  • Step 180 Receive feedback that the target application executes the corresponding service.
  • the speech understanding system and the business application may share a central processing unit (CPU).
  • the speech understanding system and the matched application are both on a smart terminal (for example, the Siri voice assistant).
  • Feedback of the target application executing the corresponding service can be received by means of program calling.
  • the speech understanding system can be arranged on the cloud side, and the application can be arranged on the terminal side, and at this time, the feedback of the target application executing the corresponding service can be received through network communication.
  • the functions of the speech understanding system and the business application may be executed by different processors, for example, the speech understanding system may be controlled by a separate device for controlling other peripheral devices (for example: small artificial intelligence assistant, etc.), at this time, the feedback of the target application executing the corresponding business can be received through network communication.
  • the speech understanding system may be controlled by a separate device for controlling other peripheral devices (for example: small artificial intelligence assistant, etc.)
  • the feedback of the target application executing the corresponding business can be received through network communication.
  • Step 190 Feed back to the user the execution result of the target application executing the corresponding service.
  • the speech understanding system can feed back the execution result of the target application executing the corresponding service to the user in any possible manner.
  • the execution result of the target application executing the corresponding service can be fed back to the user through the screen display, for example, it can be displayed on the screen that the flight booking is successful.
  • the execution result of the target application executing the corresponding service can be fed back to the user by means of voice broadcast, for example: using speech synthesis to convert the text of "air ticket booking is successful" into voice output for broadcast. It should be understood that other implementation manners that can implement feedback of the execution result of the target application executing the corresponding service to the user are all possible.
  • the intent information and slot information of the application in the integrated semantic library come from the data packet in the application's registration request, and the intent information and slot information in the data packet provided by the application specifically refer to the description information of the intent and the description information of the slot.
  • the description information of the intent supports a flexible description method.
  • a more formatted description method can be used.
  • the description information for the intent of "booking a flight ticket” can be "booking XX date to XX. Airfare to the city” etc.
  • a keyword-based description method may also be used, for example, the description information for the intent of "booking a flight ticket” may be a description method such as "booking a flight ticket”.
  • the description information of the slot also supports a flexible description method.
  • a description method similar to the attribute can be used.
  • the description information of the slot for "destination” can be "the name of a city", etc.
  • the description method in another implementation manner, may also use a keyword-based description method.
  • the description information for a slot of a "destination” may be a description method such as "arrival place” and "destination”. It should be understood that other possible descriptions of intents and slots are possible.
  • the slot information required by the application to execute a specific business including the optional slot and the fill-in slot, where the optional slot refers to the slot whose value is a predefined limited set, and the fill-in slot It means that the value of the slot is not a predefined slot, and the value of the slot may come directly from the fragment in the user instruction (extraction), or it may be generated by the semantic understanding system according to the user instruction (generation).
  • the option slot can be "judging whether the ticket is a direct ticket", and the slot value corresponding to the option slot can include two possible slot values, "Yes" and "No".
  • the fill-in slot can be "the destination of the ticket", which is an extractable fill-in slot, and the slot value can be the place names such as "Beijing" and "Shanghai” included in the user's voice input.
  • the intent information and slot information in the data packet provided by the application may have different storage forms.
  • the corresponding relationship between the intent and the slot may be specified in the data packet provided by the application.
  • the corresponding relationship between intents and slots may be that one intent pair is used for one slot package, and the slot package includes one or more slot information required by the application to execute the service corresponding to the intent.
  • a slot packet corresponding to an intent may include description information of one or more mandatory slots and description information of one or more optional slots.
  • Required slot information refers to the slot information necessary for the application to execute a specific business.
  • the necessary slots for executing the business can include “departure time”, “departure place” and “destination”. ", etc., in the absence of any one of the required slot values, the target application cannot perform specific services normally.
  • the optional slot value refers to the additional slot information when the application executes a specific business. Taking the business of "booking a flight ticket” as an example, the optional slot for executing the business can be "class of ticket", "seat preference", etc. In the absence of the slot value in the optional slot, the target application can still perform certain services normally.
  • the speech understanding system can first identify the intent in the user's voice input according to the intent information in the comprehensive semantic library, After recognizing the intent in the user's voice input, there is no need to judge all the slot information in the data packet, but only the slot information corresponding to the intent in the data packet needs to be identified whether the user's voice command contains the corresponding slot information. The slot value of these slot information is sufficient.
  • the voice understanding system can determine whether the user's voice input contains all the slot values corresponding to the required slots required by the target application to perform the corresponding service. In an implementation manner, when the user's voice input contains all the slot values corresponding to the required slots required by the target application to perform the corresponding service, the intent and slot value in the user's voice input can be directly sent to the target application. In another implementation manner, when the user's voice input does not contain all the slot values corresponding to the required slots required by the target application to execute the corresponding service, steps 150 , 160 and 170 are performed.
  • the data packet provided by the application only contains one or more intent information and one or more slot information. According to the provided data packet, it is impossible to determine which slot values are required by the application to execute a service corresponding to a certain intent.
  • the intent in the data package provided by the application does not have a corresponding relationship with the slot
  • the intent in the user's first voice input can be identified first according to the intent information in the comprehensive semantic database. There is no corresponding relationship between the slots, and the slot value in the user's first voice input can be identified according to all the slot information in the slot package of the target application determined by the intent.
  • the speech understanding system cannot directly determine whether the user's voice input contains all the required slots required for the service corresponding to the target application's execution intent.
  • the speech understanding system can transmit the recognized intent and slot information in the user's speech input to the target application, and the target application determines whether the user's speech input contains the business requirements corresponding to the target application's execution intent. of all required slots. If the user's voice input contains all the required slots required by the target application to execute the service corresponding to the intent, the corresponding service application is executed. If the user's voice input does not contain all the required slots required by the service corresponding to the target application execution intent, the missing slot information is sent to the speech understanding system, and steps 150, 160 and 170 are executed.
  • the operation of identifying the intent of the first speech input according to the intent information in the integrated semantic database, and identifying the slot value in the first speech input according to the slot information in the integrated semantic database is specifically performed by the semantic understanding model . That is, the text content corresponding to the user's voice input is input into the semantic understanding model together with the intent information and slot information in the comprehensive semantic library, and the semantic understanding model identifies the intent and slot value in the user's voice input, as shown in Figure 6 shown.
  • Semantic understanding models can be used, such as existing machine reading comprehension (MRC, Machine Reading Comprehension) models such as bidirectional attention neural network models (BERT, Bidirectional Encoder Representations from Transformers), and other models that can implement semantic understanding functions are also possible. .
  • the intent and the slot value can be identified one by one, that is, only one intent information and the text information corresponding to the user's voice input are jointly input into the semantic understanding system at a time, and the semantic understanding system determines the user's voice input. Whether the corresponding intent is included in the . Taking identifying whether the user's voice input contains the intent of "booking a flight" as an example, the description information about the intent of "booking a flight" in the data package provided by the application and the text information corresponding to the user's voice input are jointly input into the semantic understanding system.
  • the output of the model can be "Yes"
  • the output of the model can be "No”.
  • the semantic understanding system determines whether the user's voice input contains the corresponding slot value. Taking identifying whether the user's voice input contains the corresponding slot value of the "destination” as an example, the description information of the slot about the "destination" in the data packet provided by the application and the text information corresponding to the user's voice input are jointly input. In the semantic understanding system, if the user's voice input contains the corresponding slot value "Beijing" of "destination", the output of the model can be "Beijing", if the user's voice input does not contain "destination”. The corresponding slot value, the output of the model can be "does not contain the corresponding slot value”.
  • the intent in the user's voice input may be preferably recognized first, the target application is determined according to the recognized intent, and then the data of the target application is determined.
  • the slot information in the packet, identifying the slot value in the user's voice input can avoid the process of identifying the slot information in the data packets of other applications, avoid unnecessary operations, save computing time, and improve speech understanding. efficiency.
  • the training process of the semantic understanding model includes the following steps:
  • Step 710 Collect training data.
  • user instructions related to the execution of specific services by various applications may be widely collected.
  • the user instruction may cover various existing applications or applications that may interface with the semantic understanding system in the future as much as possible, so that the semantic understanding system can adapt to as many applications as possible.
  • new relevant user instructions can be further collected, and the semantic understanding system can be further collected.
  • the understanding system is further trained and upgraded to achieve interface matching with business applications.
  • the training text can also cover various expressions of each user instruction as much as possible, so as to improve the generalization ability of the semantic understanding model, so that the semantic understanding model can recognize multiple expressions of the user.
  • Step 720 Label the training data.
  • the collected massive user instructions can be marked, and the intent and slot value in the user instructions can be marked.
  • different description modes may be used to describe the intent in the user instruction and the slot corresponding to the slot value, so as to cope with different description modes for the intent and the slot in the data packets of different applications.
  • Step 730 Train the semantic understanding model with the labeled training data.
  • the description information of the intent and the user instruction are combined into training data as the input of the semantic understanding model, and the intent output by the semantic understanding model and the annotated intent are used as the input of the loss function to calculate the loss value to update the model.
  • the intent description "book a flight ticket” and the intent-marked user instruction "book a flight ticket from Shanghai to Beijing” are input into the semantic understanding model, hoping to get matching results.
  • the description information of the slot and the user instruction are combined into training data as the input of the semantic understanding model, and the semantic understanding model is used to output the slot value and the marked slot value as the loss.
  • the input of the function calculates the loss value to update the model; for example: enter the description "flight destination” of the slot “destination” and the user instruction "book a high-speed rail ticket from Shanghai to Beijing” with the slot value into the model, Hope to get "Beijing" results.
  • a trained semantic understanding model can accurately identify intent and slot values in user instructions that are the same or similar to the training data.
  • the trained semantic understanding model also has strong generalization ability, which can identify the intent and slot value in different user command expressions.
  • a trained semantic understanding model is also able to identify new intents or slots.
  • the training data of the current semantic understanding model does not contain training data related to "booking a high-speed rail ticket", but the training data contains description information of similar user instructions and intentions of "booking a flight ticket”.
  • the model already has the ability to identify the intent of "booking a flight ticket” and the relevant slot value from the user's instruction during the training process.
  • the model can successfully identify the intent and the related slot value from the user's instruction with a high probability, which is incompatible with the training data. Included adaptations for new applications.
  • FIG. 8 illustrates an example implementation of an example method according to an embodiment of the present invention, wherein the target application is a ticket booking application, and the executed service is an air ticket booking service.
  • step 811 is performed, and an interface registration request is sent to the voice understanding system, including: the identification of the booking application, the intent information corresponding to the execution service, and the slot required for executing the service.
  • the slot information corresponding to the bit value.
  • the intent information is "book flight tickets”, and in the data packets in the registration request sent by the booking application, one intent corresponds to one slot package, that is, the slot corresponding to the "book flight ticket” intent
  • the package contains mandatory slot information "Destination”, "Departure Time”, and optional slot information "Cabin Type” and "Seat Preferences”.
  • the speech understanding system After the speech understanding system receives the registration request of the booking application, it proceeds to step 821, stores the intent information and slot information in the data packet in the registration request in the comprehensive semantic database, and proceeds to step 822 to give feedback to the booking application, Send a registration success message to the booking application. So far, the booking application has completed the adaptation process with the speech understanding system.
  • step 831 is performed, and a natural language instruction is sent to the speech understanding system.
  • the instruction sent by the user is: "book a direct ticket to Shanghai for me”.
  • the voice understanding system After acquiring the user's voice input, the voice understanding system proceeds to step 823 to identify the intent and slot value in the user's voice input according to the intent information and slot information stored in the comprehensive semantic database. For example, in this example embodiment, the semantic understanding system recognizes that the user's voice input contains the intent of "booking a flight ticket", the slot value corresponding to the optional slot containing "whether direct or not” is "yes", and the slot value containing "destination” "The corresponding slot value of the filled-in slot is "Shanghai".
  • step 824 After recognizing the intent and slot value in the user's voice input, proceed to step 824, according to the intent of "booking a flight ticket", in combination with the slot value and the user's frequency of use and user ratings, etc., determine the booking application as the target application .
  • step 825 After the target application is determined, go to step 825, according to the slot information in the corresponding slot package in the data package of the ticket booking application according to the intention of "booking a flight ticket", determine whether the user's voice input contains all the required options
  • the slot value corresponding to the slot that is, whether the slot value corresponding to "Destination” and "Departure Time” is included. If the user's voice input contains the slot values corresponding to "destination” and “departure time”, go to step 827 . Taking the user instruction of "book a direct ticket to Shanghai for me” as an example, the slot value corresponding to the "departure time” is missing. In this case, step 826 is executed, and the user is required to input the slot value corresponding to the "departure time” again. .
  • step 832 After the user receives the feedback from the speech understanding system, go to step 832, input the description of the date such as “tomorrow”, “the day after tomorrow", or directly input a specific date such as "November 20, 2020” as the Supplement for missing slot values.
  • the user can also re-enter the complete command by voice, such as "help me book a direct flight to Shanghai tomorrow".
  • step 827 is performed, and the voice understanding system transmits the intent and slot values in the user's voice input to the booking software.
  • the speech understanding system sends the intent of "book a flight” and the slot values of "direct” and "Shanghai" to the booking application.
  • the ticket booking application After receiving the intent and slot value sent by the speech understanding system, the ticket booking application proceeds to step 812, and executes the business of booking a ticket according to the received intent and slot value. After executing the corresponding business, the booking application feeds back the execution status to the semantic understanding system through step 813 , and the semantic understanding system further feeds back the execution status to the user through step 828 .
  • the reservation of the flight ticket is successful, the result of the successful reservation is fed back, otherwise, the reason for the failure of the reservation is fed back.
  • FIG. 9 illustrates another exemplary implementation of the exemplary method according to an embodiment of the present invention, wherein the target application is a ticket booking application, and the executed service is an air ticket booking service.
  • the target application is a ticket booking application
  • the executed service is an air ticket booking service.
  • Most of the content of this example implementation is the same as Figure 8, except that in the data packet in the registration request sent by the booking application to the speech understanding system, there is no corresponding relationship between the intent and the slot. Therefore, the speech understanding system cannot In the data package, the required slots required by the booking application to execute the service corresponding to the intent of "booking air tickets" are obtained.
  • step 925 is performed, and the speech understanding system transmits the intent and slot value in the user's speech input to the booking application.
  • the booking application determines whether the user's voice input includes all the slot values corresponding to the required slots, that is, whether the slot values corresponding to "destination” and "departure time” are included. If the user's voice input contains slot values corresponding to "destination” and "departure time", then go to step 914 . Taking the user instruction of "book a direct ticket to Shanghai for me” as an example, the slot value corresponding to the "departure time” is missing. In this case, step 913 is executed, and the slot of the "departure time” is fed back to the speech understanding system .
  • the speech understanding system further interacts with the user through steps 926 and 932, requiring the user to input the slot value corresponding to the "departure time” again, and transmit the missing slot value to the booking application.
  • the subsequent operation steps are the same as those in Figure 8 .
  • the above example embodiment only takes the booking software as an example to show the workflow of adapting the speech understanding system to the application and performing speech understanding of the user's speech input. It should be understood that other commonly used applications are also possible.
  • a semantic understanding system 1000 is provided, and the semantic understanding system includes the following modules:
  • the communication module 1010 is configured to acquire the user's first voice input, that is, to implement the content of step 110 .
  • the communication module 1010 is further configured to receive data packets in the application registration request.
  • the data storage module 1020 is configured to store a data packet of the application, where the data packet is derived from a registration request of the application, and the data packet includes the intent information and slot information of the application.
  • the speech understanding system After receiving the registration request of the newly accessed application, the speech understanding system saves the data packets in the registration request in the comprehensive semantic database for subsequent use in semantic recognition.
  • the comprehensive semantic library contains data packets of one or more applications, wherein applications and data packets can be in a one-to-one correspondence, that is, an application corresponds to a data packet containing intent information and slot information required by the application to perform services. get data packets.
  • the comprehensive semantic library is in a state of continuous updating. Every time a new application is accessed, the data package corresponding to the newly accessed application will be saved in the comprehensive semantic library, that is, the comprehensive semantic library will be updated once, so that the comprehensive semantic library will be updated. Always include all incoming packets of the application.
  • the processing module 1030 is configured to identify the intention of the user's first voice input according to the intention information in the comprehensive semantic library, and identify the slot value in the user's first voice input according to the slot information in the comprehensive semantic library, that is, to achieve The content of step 120.
  • identifying the intent information in the user's first voice input according to the intent information in the integrated voice database, and identifying the slot value in the user's first voice input according to the slot information in the integrated voice database It can be executed by the semantic understanding model, and the specific implementation method is as shown in FIG. 6 .
  • the working process of the semantic understanding model has been described in detail in the above embodiment, and to avoid repetition, it will not be repeated here.
  • the processing module 1030 is further configured to, according to the intent of the first voice input, determine the target application of the first voice input.
  • a speech understanding system may simultaneously access multiple applications that perform the same or similar service functions, and in this case, the same intent may correspond to multiple target applications.
  • the intent of "booking a flight ticket" may correspond to multiple target applications such as Ctrip, Fliggy, and Meituan at the same time.
  • multiple target applications may be sorted according to the user's frequency of use, in another implementation manner, multiple target applications may be sorted according to user ratings, and in another implementation manner , multiple target applications can also be ranked in consideration of the user's usage frequency and user ratings. It should be understood that other possible ways of ordering the target applications are possible.
  • the first target application can be selected as the target application corresponding to the user's first voice command, and the intent and slot value in the user's voice input are sent to the target application, The corresponding service is executed by the target application.
  • the slot value included in the user's voice input may also be taken into consideration, and the target application is determined on the basis of comprehensively considering the intent and slot value in the user's voice input.
  • a feedback of speech understanding failure is output.
  • the feedback can be in any possible manner, for example, a warning of speech comprehension failure can be issued by means of screen display or voice broadcast, or the user can be required to re-enter voice input by means of screen display or voice broadcast. It should be understood that other feedback methods that can make the user aware of the speech recognition failure are also possible.
  • the communication module 1010 is further configured to transmit the intent and the slot value in the first voice input to the target application.
  • the speech understanding system and the business application may share a central processing unit (CPU), for example, the speech understanding system and the matching application are both on a smart terminal (for example: Siri voice assistant), at this time , the transfer of intent and slot value can be achieved through program calls.
  • the speech understanding system can be arranged on the cloud side, and the application can be arranged on the terminal side. In this case, the intent and slot value can be transmitted through network communication.
  • the functions of the speech understanding system and the business application may be executed by different processors, for example, the speech understanding system may be controlled by a separate device for controlling other peripheral devices (for example: small Degree artificial intelligence assistant, etc.), at this time, the transmission of intent and slot value can be realized through network communication.
  • the speech understanding system may be controlled by a separate device for controlling other peripheral devices (for example: small Degree artificial intelligence assistant, etc.)
  • the transmission of intent and slot value can be realized through network communication.
  • other arrangements of speech understanding systems and applications are also possible, and correspondingly, other implementations capable of transmitting the intent and slot value in the user's speech command to the corresponding target application are also possible.
  • the communication module 1010 is further used for:
  • Step 150 is executed, and the user is required to input the missing slot value.
  • Step 160 is executed to obtain the second voice input.
  • Step 190 is executed, and the execution result of the target application executing the corresponding service is fed back to the user.
  • Step 170 is executed to transmit the slot value in the second voice input to the target application.
  • Step 180 is executed to receive feedback that the target application executes the corresponding service.
  • steps 150 to 190 have been described in detail in the method embodiments, and in order to avoid repetition, details are not repeated here.
  • the speech understanding system and the business application can share a central processing unit (CPU). Understand that the system can control the functions of various applications on smart terminal devices, such as: playing music, answering calls, checking the weather, etc.
  • the smart terminals here can be desktop computers, televisions, tablet computers, laptop computers, smart phones, e-readers, smart watches, smart glasses, and the like.
  • the speech understanding system may be arranged on the cloud side, and the speech understanding system on the cloud side may help one or more end-side devices to perform the function of speech understanding.
  • the speech understanding system is arranged in the cloud, the speech understanding system in the cloud mainly performs the function of speech understanding, and the device on the terminal side is mainly used for interacting with the target application and interacting with the user.
  • the terminal-side device sends the intent information and slot information received from the registered application to the cloud-side device, and the cloud-side device stores the intent information and slot information received from one or more terminal-side devices.
  • the terminal device receives the voice information, it can send the voice information to the terminal device, and the terminal device can identify the intent of the voice information and the slot value of the voice information according to the stored intent information and slot information, and will identify
  • the received intent and slot value are sent to the end-side device, and the end-side device interacts with the target application to perform subsequent operations.
  • the terminal-side device requests the user to input the missing slot information, and sends the voice information re-entered by the user to the cloud-side device for voice understanding.
  • the speech understanding system can be a separate control device used to control other peripheral devices (for example: Xiaodu artificial intelligence assistant, etc.), and the functions of the peripheral devices can be controlled through the speech understanding device. control.
  • the speech understanding device can be located indoors to control various household appliances in the room; the speech understanding system can be located in the car to control various hardware systems in the cockpit, etc. It should be understood that other possible implementations of speech understanding systems are also possible.
  • a semantic understanding apparatus 1100 is provided, and the semantic understanding apparatus includes the following modules:
  • the input and output device 1110 is used for receiving the first voice input of the user.
  • the input and output device 1110 may include a voice input device such as a microphone, and is used to receive the user's first voice input to implement the content of step 110 .
  • the processor 1120 is configured to receive the data packet in the registration request of the application.
  • the memory 1130 is used to store a data packet of the application, the data packet is derived from a registration request of the application, and the data packet includes the intention information and slot information of the application.
  • the speech understanding system After receiving the registration request of the newly accessed application, the speech understanding system saves the data packets in the registration request in the comprehensive semantic database for subsequent use in semantic recognition.
  • the comprehensive semantic library contains data packets of one or more applications, wherein applications and data packets can be in a one-to-one correspondence, that is, an application corresponds to a data packet containing intent information and slot information required by the application to perform services. get data packets.
  • the comprehensive semantic library is in a state of continuous updating. Every time a new application is accessed, the data package corresponding to the newly accessed application will be saved in the comprehensive semantic library, that is, the comprehensive semantic library will be updated once, so that the comprehensive semantic library will be updated. Always include all incoming packets of the application.
  • the processor 1120 is further configured to identify the intent of the user's first voice input according to the intent information in the integrated semantic library, and identify the slot value in the user's first voice input according to the slot information in the integrated semantic library, specifically:
  • the implementation method is shown in FIG. 6 .
  • the working process of the semantic understanding model has been described in detail in the above-mentioned embodiment, and to avoid repetition, it will not be repeated here.
  • identifying the intent information in the user's first voice input according to the intent information in the integrated voice database, and identifying the slot value in the user's first voice input according to the slot information in the integrated voice database can be performed by a semantic understanding model.
  • the processor 1120 is further configured to, according to the intention of the user's first voice input, determine the target application of the user's first voice input;
  • a speech understanding system may simultaneously access multiple applications that perform the same or similar service functions, and in this case, the same intent may correspond to multiple target applications.
  • the intent of "booking a flight ticket" may correspond to multiple target applications such as Ctrip, Fliggy, and Meituan at the same time.
  • multiple target applications may be sorted according to the user's frequency of use, in another implementation manner, multiple target applications may be sorted according to user ratings, and in another implementation manner , multiple target applications can also be ranked in consideration of the user's usage frequency and user ratings. It should be understood that other possible ways of ordering the target applications are possible.
  • the target application ranked first can be selected as the target application corresponding to the user's first voice command, and the intent and slot value in the user's voice input are sent to the target application, The corresponding service is executed by the target application.
  • the slot value included in the user's voice input may also be taken into consideration, and the target application is determined on the basis of comprehensively considering the intention and slot value in the user's voice input.
  • a feedback of speech understanding failure is output.
  • the feedback can be in any possible manner, for example, a warning of speech comprehension failure can be issued by means of screen display or voice broadcast, or the user can be required to re-enter voice input by means of screen display or voice broadcast. It should be understood that other feedback methods that can make the user aware of the speech recognition failure are also possible.
  • the processor 1120 is further configured to send the intent and the slot value in the user's first voice input to the corresponding target application.
  • the input and output device 1110 is also used for:
  • Step 160 is executed to obtain the second voice input.
  • Step 150 is executed, and the user is required to input the missing slot value.
  • Step 190 is executed, and the execution result of the target application executing the corresponding service is fed back to the user.
  • the human-computer interaction interface for performing steps 160 and 190 may include a voice output device such as a speaker.
  • the voice understanding system first generates text instructions, and then uses the voice generation system (TTS) to convert the text instructions into voice and broadcast to the user.
  • TTS voice generation system
  • Processor 1120 is also used to:
  • Step 180 is executed to receive feedback that the target application executes the corresponding service.
  • Step 170 is executed to transmit the slot value in the second voice input to the target application.
  • steps 150 to 190 have been described in detail in the method embodiments, and in order to avoid repetition, details are not repeated here.
  • the voice understanding device can be a separate control device used to control other peripheral devices (for example: Xiaodu artificial intelligence assistant, etc.), and the functions of the peripheral devices can be controlled through the voice understanding device .
  • the speech understanding device can be located indoors to control various household appliances in the room; the speech understanding system can be located in the car to control various hardware systems in the cockpit, etc. It should be understood that other possible implementations of speech understanding systems are also possible.
  • a speech understanding device 1200 includes: a memory 1210 and a processor 1220 (wherein the number of processors 1220 in the speech understanding device 1200 may be one or more, and one processor is taken as an example in FIG. 12 ).
  • Memory 1210 may include read-only memory and random access memory, and provides instructions and data to processor 1220 .
  • a portion of memory 1210 may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the storage 1210 stores processor and operation instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operation instructions may include various operation instructions for implementing various operations.
  • the methods disclosed in the above embodiments of the present application may be applied to the processor 1220 or implemented by the processor 1220 .
  • the processor 1220 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method may be completed by an integrated logic circuit of hardware in the processor 1220 or an instruction in the form of software.
  • the above-mentioned processor 1220 may be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), a field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable Field-programmable gate array
  • the processor 1320 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the methods disclosed in conjunction with the embodiments of the present application may be directly embodied as executed by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor.
  • the software module may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory 1210, and the processor 1220 reads the information in the memory 1210, and completes the steps of the above method in combination with its hardware.
  • Embodiments of the present invention also provide a computer-readable storage medium. From the description of the above implementation manner, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware. It is realized by dedicated hardware including dedicated integrated circuits, dedicated CLUs, dedicated memories, and dedicated components. Under normal circumstances, all functions completed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structures used to implement the same function can also be various, such as analog circuits, digital circuits or special circuit, etc. However, a software program implementation is a better implementation in many cases for this application. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that make contributions to the prior art.
  • the computer software products are stored in a readable storage medium, such as a floppy disk of a computer. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to enable a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present application .
  • a computer device which may be a personal computer, server, or network device, etc.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server, or data center Transmission to another website site, computer, server, or data center is by wire (eg, coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wire eg, coaxial cable, optical fiber, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, a data center, etc. that includes one or more available media integrated.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.
  • the disclosed methods, apparatuses, devices, computer storage media and computer program products may be implemented in other manners.
  • the embodiments of the apparatus described above are only illustrative, and the division of the modules is only a logical function division, and there may be other division manners during specific implementation.
  • multiple modules may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the apparatus is stored in a memory in the form of executable program modules, and is called and executed by a processor, so that each module in the semantic understanding apparatus is controlled by the processor to execute corresponding operation to realize the interface matching operation between the semantic understanding system and the new business application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Telephonic Communication Services (AREA)
  • Telephone Function (AREA)

Abstract

一种语音理解方法(100),涉及自然语言理解领域,可用于设置在云侧或端侧的语音理解设备(1200)中,理解用户的语音信息,识别语音信息中的用户的意图对应的目标应用。语音理解方法(100)包括:获取用户的语音信息(110),分别根据综合语音库中存储的来源于多个注册应用的意图信息和槽位信息,识别语音信息的意图和语音信息中的槽位值(120),根据语音信息的意图,匹配语音信息对应的应用(130),并将语音信息的意图和语音信息中的槽位值传送给对应应用(140)。语音理解方法(100)能够支持用户的灵活表达,能够适配于所有的注册应用,且在适配新的应用时无需重新采集数据、标注数据和训练模型,具有适配周期短,成本低的优点。

Description

一种语音理解方法及装置 技术领域
本发明涉及自然语言理解技术领域,特别涉及一种语音理解方法。
背景技术
语音理解系统已经广泛应用在各种场景中,例如:智能手机、智能家居、智能座舱等,语音理解系统可以通过理解用户的语音输入,识别用户意图,使用户可以通过语音指令控制相关功能。一个语音理解系统通常会接入多个应用,以使用户可以通过一个语音理解系统实现对多个应用的多种功能的控制。
现有技术的一种方法中,由新接入的应用自定义语义表达模板,语音理解系统构建专用于该语义表达模板的语义理解模型,实现语音理解系统与该应用的适配。这种方案下的语义表达模板泛化能力很弱,难以支持用户的灵活表达。另一种方法中,需要预先收集新接入的应用对应的用户指令数据,利用标注后的数据对语义理解模型进行重新训练,实现语音理解系统与该应用的适配。这种方案下,每次适配新接入的应用时,均需要重新收集数据、标注数据、训练模型,适配周期长,人力、物力成本高。
发明内容
针对于现有技术中存在的问题,本发明实施例提供一种语音理解方法、设备、装置、计算机存储介质及计算机程序产品,该语音理解方法能够支持用户的各种灵活表达,具有更强的语音理解能力,能够适配所有注册应用,且在适配新的应用时无需重新采集数据、标注数据和训练模型,具有适配周期短,成本低的优点。
本发明实施例中的“意图”是指针对数据或资源执行的操作,可采用动宾短语来命名,例如,“预定机票”“查询天气”“播放音乐”等均是意图的表达。一个“意图”可能对应于一个或多个“槽位”例如,“预定机票”的意图对应于“出发时间”、“出发地”、“目的地”“机舱种类”等槽位。“槽位”用于存放数据或资源的属性,例如,机票的“出发时间”、“出发地”、“目的地”“机舱种类”等。“槽位值”是指“槽位”中存放的数据或资源的具体属性,例如,机票的“目的地”为“北京”,机票的“出发时间”为“明天”等。
本发明实施例第一方面提供一种语音理解方法,包括:
获取第一语音信息;
根据综合语义库中的意图信息,匹配第一语音信息的意图;
根据第一语音信息的意图,确定目标应用;
根据综合语义库中的槽位信息,识别第一语音信息中的槽位值;
综合语义库中的意图信息和槽位信息来源于多个注册应用;
将第一语音信息的意图和第一语音信息中的槽位值传送给目标应用。
其中,注册应用为与语音理解系统进行适配的应用,例如,注册应用可以为安装在终端设备中,与语音理解系统进行适配的应用,注册应用在语音理解系统中进行注册时,将本应用执行业务对应的意图信息和需要的槽位信息发送给语音理解系统,语音理解系统将 接收到的意图信息和槽位信息储存在综合语义库中。语义理解系统可以直接将接收到的信息逐个保存综合语义库中,即,综合语义库中可以包含对应于每一个注册应用的数据包,每个数据包中包括该应用执行对应的意图信息和需要的槽位信息。除此之外,语义理解系统也可以在接收到应用发送的信息后对信息进行一个整合,合并相同的意图信息和槽位信息,此时,综合语义库中的意图信息和槽位信息不是按照各个应用来存储的,但存在一个查询表,表明应用和其意图信息、槽位信息的对应关系。
在本发明实施例中,用户通常通过语音输入表达一个或多个意图,该意图为指示目标应用执行与该意图对应的业务。本发明实施例支持符合用户习惯的任何自然语言表达,例如,当用户想要表达“预定机票”的意图时,支持用户采用诸如“帮我预定一张去北京的机票”等较为规范、格式化的表达方式,也支持用户采用诸如“我要去北京”等较为简易、信息量较少的表达方式,还支持用户采用诸如“预定机票”“订票”等关键词式的表达方式。当然,本发明实施例还包括其他方式的表达。
应用在接入语音理解系统时,可以向语音理解系统发送注册请求,该注册请求中包含该应用的标识以及数据包,该数据包中包含该应用执行业务所需的意图信息和槽位信息。数据包中的意图信息和槽位信息,具体指意图的描述信息和槽位的描述信息。其中,意图的描述信息支持应用灵活的描述方式,例如,对于“预定机票”的意图,意图的描述信息可以是诸如“预定XX日期去往XX城市的机票”等较为格式化的描述方式,也可以是诸如“预定机票”等关键词式的描述方式。槽位的描述信息也支持应用灵活的描述方式,例如,对于“目的地”的槽位,槽位的描述信息可以是诸如“一个城市的名称”等描述方式,也可以是诸如“到达地”“目的地”等关键词式的描述方式。
应用执行特定业务所需的槽位信息,包括选择型槽位和填写型槽位,其中,选择型槽位是指槽位的取值是预先定义好的有限集合的槽位,填写型槽位是指槽位的取值不是预先定义好的槽位,其槽位值可能直接来自用户指令中的片段(抽取式),也可能由语音理解系统根据用户指令生成(生成式)。例如:选择型槽位可以为“判断机票是否为直达机票”,该选择型槽位对应的槽位值可以包括“是”和“否”两个可能的槽位值。填写型槽位可以为“机票的目的地”,该填写型槽位为抽取式的填写型槽位,槽位值可以为用户的语音输入中包含的“北京”、“上海”等地点名称。
注册请求中的数据包中的意图信息和槽位信息可储存在综合语义库中,具体的,综合语义库中包含一个或多个应用的数据包,其中应用与数据包可以为一一对应的关系,即一个应用对应的数据包中包含该应用执行业务所需的意图信息和槽位信息。综合语义库处于一个不断更新的状态,每接入一个新的应用,都会将该新接入的应用对应的数据包保存在综合语义库中,即对综合语义库进行一次更新,使得综合语义库始终包含所有注册应用的数据包。
在获取到第一语音输入后,可通过语音识别将语音输入转化为文字内容,方便后续处理。在对用户的语音输入进行理解时,将用户语音输入对应的文字内容与综合语义库中的意图信息和槽位信息共同输入语义理解模型中,由语义理解模型实现根据综合语义库中的意图信息,识别所述第一语音输入的意图,根据综合语义库中的槽位信息,识别所述第一语音输入中的槽位值。
在识别出用户的语音指令中的意图和槽位值后,将用户的语音指令中的意图和槽位值传送给对应的目标应用。这里,根据语音理解系统与注册应用不同的布置方式,“传送”可以通过多种方式实现。当语音理解系统与业务应用共用一个中央处理器(CPU,Central Processing Unit)时,可通过程序调用实现意图和槽位值的传送。当语音理解系统与业务应用的功能分别由不同的处理器执行时,例如,当语音理解系统布置在云端,或语音理解系统布置在一个独立的装置上时,可通过网络通信实现意图和槽位值的传送。
本发明实施例中语义理解模型能够从各种方式表达的语音输入中识别出意图和槽位值,因此,本发明实施例提供的语音理解方法能够支持用户各种灵活的表达方式,具有泛化能力强的优点。另外,综合语义库中始终包含所有注册应用的意图信息和槽位信息,语义理解模型可以根据各个注册应用的意图信息和槽位信息判断用户的语音输入中是否包含对应的意图和槽位值,因此,本发明实施例提供的语音理解方法能够支持所有注册应用的语音控制功能,可以做到新应用的“即插即用”,缩短语音理解系统与新应用的适配周期,减小适配新应用的成本。
结合第一方面,在第一方面第一种可能的实现方式中,该语音理解方法还包括:
根据第一语音信息的意图和第一语音信息中的槽位值,显示目标应用的第一界面。
其中第一界面可以是目标应用的主界面,即,在确定第一语音信息对应的目标应用后,显示目标应用的主界面,供用户进行后续操作。第一界面也可以展示对应于第一语音信息中的意图,满足第一语音信息的条件的所有可选项,供用户选择。例如,当第一语音信息表达“预定机票”的意图时,可以在第一界面中显示所有符合第一语音信息中的槽位值的机票,供用户进行进一步的选择。
根据第一语音信息的意图和第一语音信息中的槽位值,显示目标应用的第一界面,可以将语音理解的结果展示给用户,为用户提供进一步操作或确定的可能,用户也可以通过第一界面确认语音理解的正确性,提高了语音理解的可视化程度和准确度。
结合第一方面或第一方面第一种可能的实现方式,在第一方面第二种可能的实现方式中,将第一语音信息的意图和第一语音信息中的槽位值传送给目标应用后,该语音理解方法还包括:
输出目标应用执行目标操作的结果,目标操作为由第一语音信息的意图和第一语音信息中的槽位值确定的操作。
其中,可以采取多种方式输出目标应用执行目标操作的结果,例如,可通过屏幕显示的方式,显示目标应用执行目标操作的结果;还可以通过声音提示的方式,规定一种声音反馈表示执行成功,而另一种声音反馈表示执行失败;还可以通过语音播报的方式进行反馈,直接通过语音播报执行结果。
语音理解系统通过获取目标应用执行对应业务的反馈,并向用户反馈对应业务的执行结果,能够让用户知道自己的语音指令是否被正确识别并执行,当执行失败时可以进行再次尝试或更换其他方式执行,避免由于用户误以为执行成功而实际并没有执行成功而因此不必要的麻烦。
结合第一方面或第一方面前两种任一种可能的实现方式,在第一方面第三种可能的实现方式中:
根据第一语音信息的意图,确定多个待选应用;
根据多个待选应用各自的用户使用频率或评分确定所述目标应用。
其中,可以通过多种方式来确定待选应用,例如,可以通过查表的方式,找到对应于第一语音信息的意图的应用,另外,还可以通过在各个应用的数据包中逐一进行查找的方式,找到对应于第一语音信息的意图的应用。评分可以是用户对于目标应用的评分,也可以是网络用户对于目标应用的评分。在根据用户的使用频率或评分确定目标应用时,可以仅考虑用户的使用频率,也可以只考虑用户的评分,也可以综合考虑使用频率和用户评分。
一个语音理解系统可能同时接入多个执行相同或相似业务功能的应用,此时,同一个意图可能对应于多个目标应用。例如:“预定机票”的意图可能同时对应于携程、飞猪、美团等多个目标应用。此时,可根据用户的使用频率或用户评分对多个目标应用进行排序,并且选定排序在第一位的目标应用作为用户的第一语音指令所对应的目标应用,将用户的语音输入中的意图和槽位值发送给该目标应用,由该目标应用执行相应业务。该方法根据用户的使用习惯或好评度选择目标应用,更加贴合用户的实际需求,能够极大的提升用户体验。
另外,也可根据用户的语音输入中的意图和槽位值共同确定目标应用,当然,也可以在用户的语音输入中的意图和槽位值的基础上,参考用户的使用频率和评分等信息共同确定目标应用。
结合第一方面或第一方面前三种任一种可能的实现方式,在第一方面第四种可能的实现方式中,该语音理解方法还包括:
接收目标应用的反馈信息,所述反馈信息中包括第一语音信息中缺少的槽位信息;
请求用户输入第一语音信息中缺少的槽位信息。
应用执行某一特定业务时,通常有一个或多个必须的槽位值,如果缺少一个或多个必须的槽位值,会导致目标应用无法正常执行特定业务。因此,本方法实施例接收目标应用关于缺少的槽位信息的反馈,要求用户继续输入缺少的槽位值。通过对缺少的槽位值进行二次补充,能够提高目标应用执行特定业务的成功率,提高用户语音指令执行的成功率。
结合第一方面或第一方面前四种任一种可能的实现方式,在第一方面第五种可能的实现方式中:
综合语义库中的一个意图信息对应至少一个槽位包,一个槽位包中包括一个或多个槽位信息。
根据综合语义库中的槽位信息,识别第一语音信息中的槽位值,具体为:
根据第一语音信息的意图对应的槽位包中的槽位信息,识别第一语音输入中的槽位值。
当综合语义库中的意图信息和槽位信息存在对应关系时,即一个意图对应于一个或多个槽位包时,在确定了第一语音信息中的意图后,只需要根据第一语音信息中的意图对应的槽位包中的槽位信息,识别第一语音信息中的槽位值,减少系统的工作量,提高语音识别的效率。
结合第一方面第五种可能的实现方式,在第一方面第六种可能的实现方式中:
一个槽位包中包含一个或多个必选槽位信息,必选槽位信息为执行意图信息对应的意图所必须的槽位信息;
当从第一语音信息中识别到的槽位值的数量少于目标槽位包中的必选槽位信息的数量,请求用户输入第一语音信息中缺少的槽位信息,目标槽位包为第一语音信息的意图对应的槽位包。
其中,必选槽位信息是指,应用执行特定业务所必须的槽位信息,以“预定机票”的业务为例,执行该业务所必须的槽位可以包括“出发时间”“出发地”“目的地”等,在缺少必选槽位中的任何一个槽位值的情况下,目标应用均无法正常执行特定业务。可选槽位值是指应用执行特定业务时的附加槽位信息,以“预定机票”的业务为例,执行该业务的可选槽位可以为“机票的舱位”“座位喜好”等,在缺少可选槽位中的槽位值的情况下,目标应用仍可以正常执行特定业务。
在应用提供的数据包中一个意图对应一个槽位包,且该槽位包中包含一个或多个必选槽位和一个或多个可选槽位的情况下,可由语音理解系统根据识别到的用户的第一语音输入中的意图和槽位值,判断用户的第一语音输入中是否包含了其意图对应的所有必须的槽位值。当用户的第一语音输入中包含其意图对应的所有必须的槽位值时,直接将用户的第一语音输入中的意图和槽位值传送给对应的目标应用。当用户的第一语音输入中未包含其意图对应的所有必须的槽位值时,如本发明实施例第一方面第一种实现方式和第二种实现方式所述,进行要求用户输入缺少的槽位值,以及将重新获取到的用户的语音输入中的槽位值传送给目标应用等后续操作。通过语音理解系统直接识别缺少的槽位信息,并要求用户补充缺少的槽位信息,并将完整的意图和槽位值一次性全部发送给对应的目标应用,能够减少目标应用与语音理解系统之间交互的次数,缩短语音理解的时间,提高语音理解的效率,极大的提升用户体验。
结合第一方面第四种可能的实现方式或第一方面第六种可能的实现方式,在第一方面第七种可能的实现方式中,该语音理解方法还包括:
响应于所述请求,获取第二语音信息;
根据所述缺少的槽位信息,识别所述第二语音信息中的槽位值;
将所述第二语音信息中的槽位值传送给所述目标应用。
应用执行某一特定业务时,通常有一个或多个必须的槽位值,如果缺少一个或多个必须的槽位值,会导致目标应用无法正常执行特定业务。因此,本方法实施例在用户的第一语音输入中缺少必须的槽位值时,要求用户继续输入缺少的槽位值,获取用户的第二语音输入,并将用户的第二语音输入中的槽位值传送给对应的目标应用,以满足目标应用执行特定业务所需的全部槽位值。通过对缺少的槽位值进行二次补充,能够提高目标应用执行特定业务的成功率,提高用户语音指令执行的成功率。另外,对于用户的第二语音输入,可仅表达缺少的必选槽位值,而不需要对用户的第一语音输入进行重复,避免了用户反复输入指令,能够很大程度上提升用户体验。
本发明实施例第二方面提供一种语音理解设备,包括:
通信模块,用于接收第一语音信息;
数据存储模块,用于储存意图信息和槽位信息,意图信息和槽位信息来源于多个注册应用;
处理模块,用于根据数据存储模块中的意图信息,匹配第一语音信息的意图,根据第 一语音信息的意图,确定目标应用,根据数据存储模块中的槽位信息,识别第一语音输入中的槽位值;
通信模块,还用于发送第一语音信息的意图和第一语音信息中的槽位值。
本发明实施例第二方面提供的语音理解设备可以时虚拟设备,可以布置在云端,该语音理解设备可以与多个端侧的设备进行通信,辅助多个端侧设备完成语音理解的任务。端侧设备可以接收用户的语音信息,和端侧设备中安装的注册应用传送的意图信息和槽位信息。云侧语音理解设备从多个端侧设备中收集意图信息和槽位信息,并存储在数据存储模块中。当端侧设备需要进行语音理解时,将语音信息发送给云侧设备,由云侧语音理解设备的处理模块进行语音理解的操作,并将语音信息的意图和语音信息中的槽位只发送回端侧设备,由端侧设备和目标应用进行交互,完成后续操作。
根据第二方面,在第二方面第一种可能的实现方式中,处理模块根据第一语音信息的意图,确定目标应用,具体为:
处理模块根据第一语音信息的意图,确定多个待选应用,并根据多个待选应用各自的用户使用频率或评分确定目标应用。
根据第二方面或第二方面第一种可能的实现方式,在第二方面第二种可能的实现方式中:
一个意图信息对应至少一个槽位包,一个槽位包中包括至少一个或多个所述槽位信息。
根据槽位信息,识别第一语音输入中的槽位值,具体为:
根据第一语音信息的意图对应的槽位包中的槽位信息,识别第一语音输入中的槽位值。
根据第二方面第二种可能的实现方式,在第二方面第三种可能的实现方式中,一个意图信息对应至少一个槽位包,一个槽位包中包括至少一个或多个槽位信息,具体为:
一个槽位包中包含一个或多个必选槽位信息,必选槽位信息为执行意图信息对应的意图所必须的槽位信息;
当从第一语音信息中识别到的槽位值的数量少于目标槽位包中的必选槽位信息的数量,通信模块还用于发送第一语音信息中缺少的槽位信息,目标槽位包为第一语音信息的意图对应的槽位包。
当第一语音信息中缺少必选槽位信息对应的槽位值时,语音理解系统将缺少的槽位信息发送给端侧设备,由端侧设备与用户进行交互,请求用户补充缺少的槽位信息。
根据第二方面第三种可能的实现方式,在第二方面第四种可能的实现方式中,通信模块将第一语音信息中缺少的槽位信息发送给第二设备后,通信模块,还用于获取第二语音信息;
处理模块,还用于根据缺少的槽位信息,识别第二语音信息中的槽位值;
通信模块,还用于发送第二语音信息中的槽位值。
本发明实施例第二方面提供的语音理解系统的各种可能的实现方式与第一方面相同,且能够达到上述所有的有益效果,为避免重复,此处不再进行赘述。
本发明实施例第三方面提供一种语音理解装置,包括:
麦克风,用于采集第一语音信息;
存储器,用于储存意图信息和槽位信息,意图信息和槽位信息来源于多个注册应用;
处理器,用于根据意图信息,匹配第一语音信息的意图,根据第一语音信息的意图,确定目标应用,根据槽位信息,识别第一语音信息中的槽位值;
所述处理器,还用于将第一语音信息的意图和第一语音信息中的槽位值传送给目标应用。
结合第三方面,在第三方面第一种可能的实现方式中:
语音理解装置还包括显示屏;
处理器用于根据第一语音信息的意图和第一语音信息中的槽位值,指示显示屏显示目标应用的第一界面。
结合第三方面或第三方面第一种可能的实现方式,在第三方面第二种可能的实现方式中:
语音理解装置还包括输出装置,用于在处理器将第一语音信息的意图和第一语音信息中的槽位值给目标应用后,根据处理器的指示输出目标应用执行目标操作的结果,目标操作为由第一语音信息的意图和第一语音信息中的槽位值确定的操作。
其中,输出装置可以是喇叭,显示屏等任一可以对用户输出反馈的装置。
结合第三方面或第三方面前两种可能的实现方式,在第三方面第三种可能的实现方式中,处理器根据第一语音信息的意图,确定目标应用,具体为:
处理器根据第一语音信息的意图,确定多个待选应用;
处理器根据多个待选应用各自的用户使用频率或评分确定目标应用。
结合第三方面或第三方面第一种或第三方面第三种可能的实现方式,在第三方面第四种可能的实现方式中:
处理器,还用于接收目标应用的反馈信息,反馈信息中包括第一语音信息中缺少的槽位信息;
语音理解装置还包括输出装置,用于输出第一请求,第一请求用于请求用户输入第一语音信息中缺少的槽位信息。
结合第三方面或第三方面前四种可能的实现方式,在第三方面第五种可能的实现方式中:
一个意图信息对应至少一个槽位包,一个槽位包中包括一个或多个槽位信息。
根据槽位信息,识别第一语音信息中的槽位值,具体为:
根据第一语音信息的意图对应的槽位包中的槽位信息,识别第一语音输入中的槽位值。
结合第三方面第五种可能的实现方式,在第三方面第六种可能的实现方式中:
一个意图信息对应至少一个槽位包,一个槽位包中包括一个或多个槽位信息,具体为:
一个槽位包中包含一个或多个必选槽位信息,必选槽位信息为执行意图信息对应的意图所必须的槽位信息;
语音理解装置还包括输出装置,当从第一语音信息中识别到的槽位值的数量少于目标槽位包中的必选槽位信息的数量,输出装置用于请求用户输入第一语音信息中缺少的槽位信息,目标槽位包为第一语音信息的意图对应的槽位包。
结合第三方面第四种或第三方面第六种可能的实现方式,在第三方面第七种可能的实现方式中:
请求用户输入第一语音信息中缺少的槽位信息之后,麦克风还用于采集第二语音信息;
处理器还用于根据所述缺少的槽位信息,识别第二语音信息中的槽位值;
处理器还用于将第二语音信息中的槽位值传送给目标应用。
本发明实施例第三方面提供的语音理解装置的各种可能的实现方式与第一方面相同,且能够达到上述所有的有益效果,为避免重复,此处不再进行赘述。
本发明实施例的第四方面提供一种计算机存储介质,包括计算机指令,当计算机指令在计算机上运行时,使得计算机执行如第一方面或第一方面前七种可能的实现方式,且能够达到上述所有的有益效果。
本发明实施例的第五方面提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行如第一方面或第一方面前七种可能的实现方式,且能够达到上述所有的有益效果。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对现有技术中以及本发明实施例描述中所需要使用的附图作简单地介绍。
图1是本发明实施例提供的一种语音理解方法;
图2是本发明实施例提供的另一种语音理解方法;
图3是本发明实施例提供的一种数据包的形态;
图4是本发明实施例提供的另一种数据包的形态;
图5是本发明实施例提供的另一种数据包的形态;
图6是本发明实施例提供的一种语义理解模型进行语义理解的方式;
图7是本发明实施例提供的一种语义理解模型的训练方法;
图8是本根据本发明实施例的示例方法的一种示例实现方式;
图9是本根据本发明实施例的示例方法的另一种示例实现方式;
图10是本发明实施例提供的一种语音理解系统;
图11是本发明实施例提供的一种语音理解装置;
图12是本发明实施例提供的一种语音理解设备。
具体实施方式
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本发明实施例可应用于各种通过语音输入控制应用执行相关功能的场景中。作为示例, 例如,用户通过语音输入控制智能终端设备上的应用软件执行相应功能,用户在室内通过语音输入控制家用电器执行相应功能,用户在车内通过语音输入控制座舱内的硬件装置的功能或多媒体系统的功能等。在前述种种场景中,一个语音理解系统通常会接入多种应用,以使用户可以通过一个语音理解系统实现对多种功能的控制。然而,不同垂域下的应用的接口具有很大的不同,即使是相同垂域下的应用,开发者设计的接口也会有差异,使得语音理解系统适配各种接入的应用,并实现语音理解成为一种较难的任务。另外,不同用户对于同一个指令可能有不同的表达方式,同一个用户在不同的应用场景下,对于同一个指令也可能有不同的表达方式,使得语音理解系统识别各种用户表达中对应的目标应用成为一种较难的任务。
为解决上述问题,本申请实施例提供了一种语音理解方法,该方法可应用于智能终端中,也可布置在云端处理器中,还可以应用于独立的语音理解设备中。该语音理解方法中,接入的应用首先向语音理解系统发送注册请求,注册请求中包含数据包,该数据包中包含该应用执行业务所需的意图信息和槽位信息。语音理解系统将应用的注册请求中的数据包储存在综合语义库中,并根据应用提供的数据包中的意图信息和槽位信息识别用户的语音输入中的意图和槽位值,使得语音理解系统能够灵活的适配各种接入的应用,具有适配周期短,成本低的优点。另外,语音理解模型支持用户的各种灵活表达方式,具有更强的语音理解能力。
请参阅图1,在本发明一个实施例中,提供一种语音理解方法100,该方法包含如下步骤。
步骤110:获取第一语音输入;
用户的语音输入通常表达一种意图,且该意图为指示目标应用执行与该意图对应的业务。本发明实施例支持符合用户习惯的任何自然语言表达,例如,当用户想要表达“预定机票”的意图时,支持用户采用诸如“帮我预定一张去北京的机票”等较为规范的表达方式,也支持用户采用诸如“我要去北京”等较为简易的表达方式,还支持用户采用诸如“预定机票”“订票”等关键词式的表达方式。当然,本发明实施例还包括其他方式的表达。
在一种实现方式中,步骤110还包括语音识别步骤,用于将用户的语音输入转化为文字,以便于后续处理。
步骤120:根据综合语义库中的意图信息,识别第一语音输入的意图,根据综合语义库中的槽位信息,识别第一语音输入中的槽位值。
综合语音库中的意图信息和槽位信息来源于应用的注册请求中的数据包,即,当应用与语音理解系统进行适配时,需要向语音理解系统发送注册请求,该注册请求中包括该应用的标识以及包含该应用执行业务所需的意图信息和槽位信息的数据包,数据包的具体形态将在下文进行详细介绍。
语音理解系统在接收到新接入的应用的注册请求后,将注册请求中的数据包保存在综合语义库中,供后续进行语义识别时调用。在一种实现方式中,综合语义库中可以包含一个或多个应用的数据包,其中应用与数据包可以为一一对应的关系,即一个应用对应与一个包含该应用执行业务所需得意图信息和槽位信息的数据包。在另一种实现方式中,综合语义库可以对接收到的应用的注册请求包中的意图和槽位信息进行一个整理和合并,例如, 同一类型的应用程序可能具有相同的意图,在综合语义库中可以将相同的意图合并存储,此时,综合语音库中不存在数据包的概念。
综合语义库处于一个不断更新的状态,每接入一个新的应用,都会将该新接入的应用对应的数据包保存在综合语义库中,即对综合语义库进行一次更新,使得综合语义库始终包含所有接入应用的数据包。
在一种实现方式中,根据综合语音库中的意图信息识别用户的第一语音输入中的意图信息,和根据综合语音库中的槽位信息识别用户的第一语音输入中的槽位值得步骤可由语义理解模型执行,由语义理解模型执行语义理解的具体过程将在下文进行详细介绍。
步骤130:根据第一语音输入的意图,确定第一语音输入目标应用。
在一种实现方式中,仅有一个应用包含从用户的第一语音输入中识别到的意图时,此时,可直接确定该应用为用户的第一语音输入的目标应用。在另一种实现方式中,一个语音理解系统可能同时接入多个执行相同或相似业务功能的应用,此时,同一个意图可能对应于多个目标应用。例如:“预定机票”的意图可能同时对应于携程、飞猪、美团等多个目标应用。此时,在一种实现方式中,可根据用户的使用频率对多个目标应用进行排序,在另一种实现方式中,可根据用户评分对多个目标应用进行排序,在另一种实现方式中,还可以综合考虑用户的使用频率和用户评分对多个目标应用进行排序。应当理解,其他可能的对目标应用进行排序的方式也是可能的。在对多个目标应用进行排序后,可选定排序在第一位的目标应用作为用户的第一语音指令所对应的目标应用,将用户的语音输入中的意图和槽位值发送给该目标应用,由该目标应用执行相应业务。另外,在确定目标应用时,也可以将用户的语音输入中包含的槽位值纳入考虑,在综合考虑用户的语音输入中的意图和槽位值的基础上,确定目标应用。
在一种实现方式中,若未能正确识别用户的语音输入中的意图信息,则输出语音理解失败的反馈。该反馈可以采用任何可能的方式,例如:可通过屏幕显示或语音播报的方式发出语音理解失败的警告,也可通过屏幕显示或语音播报的方式要求用户重新进行语音输入。应当理解,其他能够使用户知晓语音识别失败的反馈方式也是可能的。
可选的,在确定目标应用后,可显示目标应用的界面,在一种实现方式中,可以显示目标应用的主界面,供用户操作,在另一种实现方式中,可以显示对应于第一语音信息的意图和槽位值的界面,例如,当用户的意图为“预定机票”时,可显示所有符合条件的机票显示出来,供用户选择。
步骤140:将第一语音输入中的意图和槽位值传送给目标应用。
在识别出用户的语音指令中的意图和槽位值后,将用户的语音指令中的意图和槽位值传送给对应的目标应用。根据语音理解系统与应用布置方式的不同,可以有不同的传送方式。在一种实现方式中,语音理解系统与业务应用可能共用一个中央处理器(CPU,Central Processing Unit),例如,语音理解系统与所匹配的应用均在一个智能终端上(例如:Siri语音助手),此时,可通过程序调用实现意图和槽位值的传送。在另一种实现方式中,语音理解系统可布置在云侧,应用可布置在端侧,此时,可通过网络通信实现意图和槽位值的传送。在另一种实现方式中,语音理解系统与业务应用的功能可以分别由不同的处理器执行,例如,语音理解系统可以为一个单独的装置控制,用于对其他外围装置进行控制(例 如:小度人工智能助手等),此时,可通过网络通信实现意图和槽位值的传送。应当理解,其他语音理解系统与应用的布置方式也是可能的,于此对应的,其他能够实现将用户的语音指令中的意图和槽位值传送给对应的目标应用的实现方式也是可能的。
请参阅图2,在本发明一个实施例中,语音理解方法100还包含如下步骤。
步骤150:要求用户输入缺少的槽位值。
应用执行特定业务时,通常有一个或多个必须的槽位值,缺少必选槽位值将导致应用无法执行该特定业务。因此,在一种实现方式中,当用户的第一语音输入中缺乏必选槽位值时,要求用户继续输入缺少的槽位值。语音理解系统可以以任何可能的方式要求用户继续输入缺少的槽位值。在一种实现方式中,可以通过屏幕显示要求用户输入缺少的槽位值,例如:可以在屏幕上显示:请输入XX槽位的槽位值。在另一种实现方式中,可以通过语音播报的方式要求用户输入缺少的槽位值,例如:利用语音合成(TTS,Text to Speech)将要求用户输入缺少的槽位值的指令转换为语音输出进行播报。应当理解,其他能够实现要求用户输入缺少的槽位值得实现方式都是可能的。
步骤160:获取第二语音输入。
在一种实现方式中,用户的第二语音输入可以仅包含缺少的槽位值,例如,采用“北京”“上海”等关键词式表达补充关于“目的地”这个槽位的槽位值。在另一种实现方式中,用户的第二语音输入可以包含第一语音输入和第一语音输入中缺少的槽位值,例如,如果第一语音输入仅表达了“帮我预定明天的机票”,第二语音输入可以为“帮我预定明天去北京的机票”。
步骤170:将第二语音输入中的槽位值传送给目标应用。
步骤170中将第二语音输入中的槽位值传送给目标应用的实现方式与步骤140相同,为避免重复,此处不再赘述。
步骤180:接收目标应用执行对应业务的反馈。
根据语音理解系统与应用布置方式的不同,可以有不同的接收方式。在一种实现方式中,语音理解系统与业务应用可能共用一个中央处理器(CPU),例如,语音理解系统与所匹配的应用均在一个智能终端上(例如:Siri语音助手),此时,可通过程序调用的方式接收目标应用执行对应业务的反馈。在另一种实现方式中,语音理解系统可布置在云侧,应用可布置在端侧,此时,可通过网络通信接收目标应用执行对应业务的反馈。在另一种实现方式中,语音理解系统与业务应用的功能可以分别由不同的处理器执行,例如,语音理解系统可以为一个单独的装置控制,用于对其他外围装置进行控制(例如:小度人工智能助手等),此时,可通过网络通信接收目标应用执行对应业务的反馈。应当理解,其他语音理解系统与应用的布置方式也是可能的,于此对应的,其他能够实现接收目标应用执行对应业务的反馈的实现方式也是可能的。
步骤190:向用户反馈目标应用执行对应业务的执行结果。
语音理解系统可以以任何可能的方式向用户反馈目标应用执行对应业务的执行结果。在一种实现方式中,可以通过屏幕显示向用户反馈目标应用执行对应业务的执行结果,例如:可以在屏幕上显示:机票预定成功。在另一种实现方式中,可以通过语音播报的方式向用户反馈目标应用执行对应业务的执行结果,例如:利用语音合成将“机票预定成功” 的文字转换为语音输出进行播报。应当理解,其他能够实现向用户反馈目标应用执行对应业务的执行结果的实现方式都是可能的。
综合语义库中应用的意图信息和槽位信息来源于应用的注册请求中的数据包,应用提供的数据包中的意图信息和槽位信息,具体指意图的描述信息和槽位的描述信息。其中,意图的描述信息支持应用灵活的描述方式,在一种实现方式中,可以采用较为格式化的描述方式,例如,对于“预定机票”的意图的描述信息可以是“预定XX日期去往XX城市的机票”等描述方式。在另一种实现方式中,也可以采用关键词式的描述方式,例如,对于“预定机票”的意图的描述信息可以是“预定机票”等描述方式。槽位的描述信息也支持应用灵活的描述方式,在一种实现方式中,可以采用类似属性的描述方式,例如,对于“目的地”的槽位的描述信息可以是“一个城市的名称”等描述方式,在另一种实现方式中,也可以采用键词式的描述方式,例如,对于“目的地”的槽位的描述信息可以是“到达地”“目的地”等描述方式。应当理解,其他可能的关于意图和槽位的描述信息也是可以的。
应用执行特定业务所需的槽位信息,包括选择型槽位和填写型槽位,其中,选择型槽位是指槽位的取值是预先定义好的有限集合的槽位,填写型槽位是指槽位的取值不是预先定义好的槽位,其槽位值可能直接来自用户指令中的片段(抽取式),也可能由语义理解系统根据用户指令生成(生成式)。例如:选择型槽位可以为“判断机票是否为直达机票”,该选择型槽位对应的槽位值可以包括“是”和“否”两个可能的槽位值。填写型槽位可以为“机票的目的地”,该填写型槽位为抽取式的填写型槽位,槽位值可以为用户的语音输入中包含的“北京”、“上海”等地点名称。
在一种实施例中,应用提供的数据包中的意图信息和槽位信息可能有不同的储存形式。如图3所示,在一种实现方式中,应用提供的数据包中可以规定意图与槽位之间的对应关系。意图与槽位之间的对应关系可以为,一个意图对用于一个槽位包,该槽位包中包含应用执行该意图对应的业务所需的一个或多个槽位信息。进一步的,如图4所述,在一种实现方式中,一个意图对应的槽位包中可以包含一个或多个必选槽位的描述信息和一个或多个可选槽位的描述信息。
必选槽位信息是指,应用执行特定业务所必须的槽位信息,以“预定机票”的业务为例,执行该业务所必须的槽位可以包括“出发时间”“出发地”“目的地”等,在缺少必选槽位中的任何一个槽位值的情况下,目标应用均无法正常执行特定业务。可选槽位值是指应用执行特定业务时的附加槽位信息,以“预定机票”的业务为例,执行该业务的可选槽位可以为“机票的舱位”“座位喜好”等,在缺少可选槽位中的槽位值的情况下,目标应用仍可以正常执行特定业务。
当应用提供的数据包中的意图与槽位具有对应关系,即,一个意图对应于一个槽位包时,语音理解系统可以首先根据综合语义库中的意图信息识别用户的语音输入中的意图,在识别到用户的语音输入中的意图后,不需要再对数据包中的全部槽位信息进行判断,仅需要针对数据包中该意图对应的槽位信息,识别用户语音指令中是否包含对应于这些槽位信息的槽位值即可。
在识别到用户的语音输入中的意图和槽位值后,可以由语音理解系统判断用户的语音 输入中是否包含了全部的目标应用执行对应业务所需的必选槽位对应的槽位值。在一种实现方式中,当用户的语音输入中包含了全部的目标应用执行对应业务所需的必选槽位对应的槽位值时,可直接将用户的语音输入中的意图和槽位值传送给目标应用。在另一种实现方式中,当用户的语音输入未包含全部的目标应用执行对应业务所需的必选槽位对应的槽位值时,执行步骤150、步骤160和步骤170。
如图5所示,在另一种实现方式中,应用提供的数据包中意图与槽位之间无对应关系。应用提供的数据包中仅包含一个或多个意图信息和一个和多个槽位信息,根据提供的数据包无法确定该应用执行某一个意图对应的业务需要哪些槽位值。
当应用提供的数据包中的意图与槽位不具有对应关系时,可以首先根据综合语义库中的意图信息,识别用户的第一语音输入中的意图,由于应用提供的数据包中的意图与槽位之间没有对应关系,可根据该意图确定的目标应用的槽位包中的全部槽位信息,识别用户的第一语音输入中的槽位值。
由于应用提供的数据包中的意图与槽位不具有对应关系时,语音理解系统无法直接判断用户的语音输入中是否包含了目标应用执行意图对应的业务所需的全部必选槽位,在一种实现方式中,语音理解系统可将识别到的用户的语音输入中的意图和槽位信息传送给目标应用,由目标应用判断用户的语音输入中是否包含了目标应用执行意图对应的业务所需的全部必选槽位。若用户的语音输入中包含了目标应用执行意图对应的业务所需的全部必选槽位,则执行对应的业务应用。若用户的语音输入中未包含了目标应用执行意图对应的业务所需的全部必选槽位,则将缺少的槽位信息发送给语音理解系统,执行步骤150、步骤160和步骤170。
根据综合语义库中的意图信息,识别所述第一语音输入的意图,和根据综合语义库中的槽位信息,识别所述第一语音输入中的槽位值的操作具体由语义理解模型执行。即,将用户的语音输入对应的文字内容与综合语义库中的意图信息和槽位信息共同输入语义理解模型中,由语义理解模型识别用户的语音输入中的意图和槽位值,如图6所示。语义理解模型可采用,例如:双向注意力神经网络模型(BERT,Bidirectional Encoder Representations from Transformers)等现有的机器阅读理解(MRC,Machine Reading Comprehension)模型,其他能够实现语义理解功能的模型也是可能的。
在一种实现方式中,可逐一识别意图和槽位值,即每次仅将一个意图信息和用户的语音输入对应的文字信息共同输入到语义理解系统中,由语义理解系统判断该用户语音输入中是否包含对应的意图。以识别用户的语音输入中是否包含“预定机票”的意图为例,将应用提供的数据包中关于“预定机票”的意图的描述信息和用户语音输入对应的文字信息共同输入到语义理解系统中,若用户的语音输入中包含“预定机票”的意图,则模型的输出可以为“是”,若用户的语音输入中不包含“预定机票”的意图,则模型的输出可以为“否”。
或者每次仅将一个槽位信息和用户的语音输入对应的文字信息共同输入到语义理解系统中,由语义理解系统判断该用户语音输入中是否包含对应的槽位值。以识别用户的语音输入中是否包含“目的地”的对应的槽位值为例,将应用提供的数据包中关于“目的地”的槽位的描述信息和用户语音输入对应的文字信息共同输入到语义理解系统中,若用户的语音输入中包含“目的地”的对应的槽位值“北京”,则模型的输出可以为“北京”,若用 户的语音输入中不包含“目的地”的对应的槽位值,则模型的输出可以为“不包含对应槽位值”。
在运算能力允许的情况下也可以同时运行多个语义理解模型,实现多个意图或槽位值的并行识别。在识别用户的语音输入中的意图和槽位值时,在一种实现方式中,可以优选的先识别用户的语音输入中的意图,根据识别到的意图确定目标应用,再根据目标应用的数据包中的槽位信息,识别用户的语音输入中的槽位值,可以避免识别其他应用的数据包中的槽位信息的过程,避免了不必要的运算,能够节约运算时间,提高语音理解的效率。
为了实现语义理解功能,需要在先对语义理解模型进行训练。请参阅图7,在一个实施例中,对语义理解模型的训练过程包括如下步骤:
步骤710:收集训练数据。
在一种实现方式中,可以广泛收集与各种应用执行特定业务相关的用户指令。在一种实现方式中,用户指令可以尽可能地涵盖各种现有的或者将来可能与语义理解系统进行对接的应用,以使得语义理解系统能够适配尽可能多的应用。在另一种实现方式中,当原本的用户指令中没有包含特定的用户指令,导致语义理解系统无法与某个或某些业务应用进行适配时,可以进一步收集新的相关用户指令,对语义理解系统进行进一步的训练升级,以实现与业务应用的接口匹配。训练文本还可以尽可能地涵盖每一种用户指令的各种表达方式,以提高语义理解模型的泛化能力,使得语义理解模型能够识别用户的多种表达方式。
步骤720:标注训练数据。
在一种实现方式中,可以对收集到的海量用户指令进行标注,标注用户指令中的意图和槽位值。另外,可以采用不同的描述方式对用户指令中的意图和槽位值对应的槽位进行描述,以应对不同应用的数据包中对于意图和槽位不同的描述方式。
步骤730:用标注后的训练数据对语义理解模型进行训练。
在一种实现方式中,对于每一个用户指令,将意图的描述信息和用户指令组合成训练数据作为语义理解模型的输入,使用语义理解模型输出的意图以及标注的意图作为损失函数的输入计算损失值来更新模型。例如:将意图描述“预定机票”与标注了意图的用户指令“订一张从上海到北京的飞机票”共同输入到语义理解模型中,希望得到匹配的结果。
在另一种实现方式中,对于每条用户指令,将槽位的描述信息和用户指令组合成训练数据作为语义理解模型的输入,使用语义理解模型输出槽位值以及标注的槽位值作为损失函数的输入计算损失值来更新模型;例如:将槽位“目的地”的描述“航班目的地”与标注了槽位值的用户指令“订一张从上海到北京的高铁票”输入模型,希望得到“北京”的结果。
训练好的语义理解模型能够准确识别与训练数据相同或相似的用户指令中的意图和槽位值。训练好的语义理解模型还具备很强的泛化能力,能够识别不同的用户指令表达中的意图和槽位值。另外,训练好的语义理解模型还能够识别新的意图或槽位。例如,当前的语义理解模型的训练数据中并没有与“预定高铁票”相关的训练数据,但是训练数据中包含与它类似的“预定机票”的相关用户指令及意图的描述信息。此时,模型在训练的过程中已经具备了从用户指令中识别“预定机票”的意图和相关槽位值的能力。当新接入的应用的数据包中包含“预定机票”的意图信息和相关槽位信息时,模型大概率能够从用户指 令中成功识别该意图和相关的槽位值,实现与训练数据中不包含的新的应用的适配。
图8示例了根据本发明实施例的示例方法的一种示例实现方式,其中,目标应用为订票应用,所执行的业务为机票预定业务。请参阅图8,当订票应用接入语音理解系统后,进行步骤811,向语音理解系统发送接口注册请求,包含:订票应用的标识,执行业务对应的意图信息,执行业务所需的槽位值对应的槽位信息。例如,在本示例实施方式中,意图信息为“预定机票”,订票应用发送的注册请求中的数据包中,一个意图对应一个槽位包,即,“预定机票”的意图对应的槽位包中包含必选槽位信息“目的地”、“出发时间”、以及可选槽位信息“机舱类型”“座位喜好”。
语音理解系统接收到订票应用的注册请求后,进行步骤821,将注册请求中的数据包中得意图信息和槽位信息存储在综合语义库中,并进行步骤822对订票应用进行反馈,向订票应用发送注册成功的消息。至此,订票应用完成了与语音理解系统的适配过程。
当用户需要订票应用执行预定机票的业务时,进行步骤831,向语音理解系统发送自然语言指令。例如,在本示例实施方式中,用户发送的指令为:“帮我订一张直达去上海的机票”。
语音理解系统获取用户的语音输入后,进行步骤823,根据综合语义库中储存的意图信息和槽位信息,识别用户的语音输入中的意图和槽位值。例如,在本示例实施方式中,语义理解系统识别到用户的语音输入中包含“预定机票”的意图,包含“是否直达”的选择型槽位对应的槽位值“是”,包含“目的地”的填写型槽位对应的槽位值为“上海”。
在识别了用户的语音输入中的意图和槽位值后,进行步骤824,根据“预定机票”的意图,结合槽位值及用户的使用频率和用户评分等指标,确定订票应用为目标应用。
在确定了目标应用之后,进行步骤825,根据“预定机票”的意图在订票应用的数据包中对应的槽位包中的槽位信息,判断用户的语音输入中是否包含了全部的必选槽位对应的槽位值,即,是否包含了“目的地”和“出发时间”对应的槽位值。若用户的语音输入中包含了“目的地”和“出发时间”对应的槽位值,则进行步骤827。以“帮我订一张直达去上海的机票”的用户指令为例,缺少“出发时间”对应的槽位值,此时,执行步骤826,要求用户再次输入“出发时间”对应的槽位值。
在用户收到语音理解系统的反馈后,进行步骤832,通过语音输入“明天”、“后天”等关于日期的描述,或是直接输入诸如“2020年11月20日”某个具体日期,作为对缺少的槽位值的补充。另外,用户也可以重新语音输入完整的指令,如“帮我订一张明天直达去上海的机票”。
在获取到用户的语音输入中的意图和全部必选槽位对应的槽位值后,进行步骤827,语音理解系统将用户的语音输入中的意图和槽位值传送给订票软件。例如,在本示例实施方式中,语音理解系统将“预定机票”的意图和“直达”、“上海”的槽位值发送给订票应用。
订票应用接收到语音理解系统发送的意图和槽位值后,进行步骤812,根据接收到的意图和槽位值执行预定机票的业务。订票应用在执行完对应业务后,通过步骤813将执行状态反馈给语义理解系统,语义理解系统进而通过步骤828将执行状态反馈给用户。在本示例实施方式中,若预定机票成功,则将预定成功的结果进行反馈,否则,反馈预定失败 的原因。
图9示例了根据本发明实施例的示例方法的另一种示例实现方式,其中,目标应用为订票应用,所执行的业务为机票预定业务。该示例实现方式的大部分内容与图8相同,不同点在于订票应用发送给语音理解系统的注册请求中的数据包中,意图与槽位之间没有对应关系,因此,语音理解系统无法从数据包中得到订票应用执行“预定机票”的意图对应的业务时所需的必选槽位。
在确定了目标应用之后,进行步骤925,语音理解系统将用户的语音输入中的意图和槽位值传送给订票应用。在步骤912中,由订票应用判断用户的语音输入中是否包含了全部的必选槽位对应的槽位值,即,是否包含了“目的地”和“出发时间”对应的槽位值。若用户的语音输入中包含了“目的地”和“出发时间”对应的槽位值,则进行步骤914。以“帮我订一张直达去上海的机票”的用户指令为例,缺少“出发时间”对应的槽位值,此时,执行步骤913,将“出发时间”的槽位反馈给语音理解系统。由语音理解系统通过步骤926和步骤932与用户进行进一步交互,要求用户再次输入“出发时间”对应的槽位值,并将缺少的槽位值传送给订票应用。后续操作步骤与图8相同。
上述示例实施方式仅以订票软件为例,展示语音理解系统与应用进行适配及对用户的语音输入进行语音理解的工作流程,应当理解,其他常用的应用也是可能的。
在图1至图9所对应的实施例的基础上,为了更好的实施本申请实施例的上述方案,下面还提供用于实施上述方案的相关设备。
基于前述实施例相同的技术构思,请参阅图10,在本发明一个实施例中,提供一种语义理解系统1000,该语义理解系统包含如下模块:
通信模块1010,用于获取用户的第一语音输入,即,实现步骤110的内容。
通信模块1010,还用于接收应用的注册请求中的数据包。
数据存储模块1020,用于储存应用的数据包,该数据包来源于应用的注册请求,该数据包包含应用的意图信息和槽位信息。
语音理解系统在接收到新接入的应用的注册请求后,将注册请求中的数据包保存在综合语义库中,供后续进行语义识别时调用。具体的,综合语义库中包含一个或多个应用的数据包,其中应用与数据包可以为一一对应的关系,即一个应用对应与一个包含该应用执行业务所需得意图信息和槽位信息得数据包。综合语义库处于一个不断更新的状态,每接入一个新的应用,都会将该新接入的应用对应的数据包保存在综合语义库中,即对综合语义库进行一次更新,使得综合语义库始终包含所有接入应用的数据包。
处理模块1030,用于根据综合语义库中的意图信息,识别用户的第一语音输入的意图,根据综合语义库中的槽位信息,识别用户的第一语音输入中的槽位值,即实现步骤120的内容。
在一种实现方式中,根据综合语音库中的意图信息识别用户的第一语音输入中的意图信息,和根据综合语音库中的槽位信息识别用户的第一语音输入中的槽位值得步骤可由语义理解模型执行,具体的实现方法如图6所述,上述实施例中已经对语义理解模型的工作过程进行了详细的描述,为避免重复,这里不再进行赘述。
处理模块1030还用于,根据第一语音输入的意图,确定第一语音输入的目标应用。
在一种实现方式中,仅有一个应用包含从用户的第一语音输入中识别到的意图时,此时,可直接确定该应用为用户的第一语音输入的目标应用。在另一种实现方式中,一个语音理解系统可能同时接入多个执行相同或相似业务功能的应用,此时,同一个意图可能对应于多个目标应用。例如:“预定机票”的意图可能同时对应于携程、飞猪、美团等多个目标应用。此时,在一种实现方式中,可根据用户的使用频率对多个目标应用进行排序,在另一种实现方式中,可根据用户评分对多个目标应用进行排序,在另一种实现方式中,还可以综合考虑用户的使用频率和用户评分对多个目标应用进行排序。应当理解,其他可能的对目标应用进行排序的方式也是可能的。在对多个目标应用进行排序后,可选定排名第一的目标应用作为用户的第一语音指令所对应的目标应用,将用户的语音输入中的意图和槽位值发送给该目标应用,由该目标应用执行相应业务。另外,在确定目标应用时,也可以将用户的语音输入中包含的槽位值纳入考虑,在综合考虑用户的语音输入中的意图和槽位值的基础上,确定目标应用。
在一种实现方式中,若未能正确识别用户的语音输入中的意图信息,则输出语音理解失败的反馈。该反馈可以采用任何可能的方式,例如:可通过屏幕显示或语音播报的方式发出语音理解失败的警告,也可通过屏幕显示或语音播报的方式要求用户重新进行语音输入。应当理解,其他能够使用户知晓语音识别失败的反馈方式也是可能的。
通信模块1010还用于,将第一语音输入中的意图和槽位值传送给目标应用。
根据语音理解系统与应用布置方式的不同,可以有不同的传送方式。在一种实现方式中,语音理解系统与业务应用可能共用一个中央处理器(CPU,),例如,语音理解系统与所匹配的应用均在一个智能终端上(例如:Siri语音助手),此时,可通过程序调用实现意图和槽位值的传送。在另一种实现方式中,语音理解系统可布置在云侧,应用可布置在端侧,此时,可通过网络通信实现意图和槽位值的传送。在另一种实现方式中,语音理解系统与业务应用的功能可以分别由不同的处理器执行,例如,语音理解系统可以为一个单独的装置控制,用于对其他外围装置进行控制(例如:小度人工智能助手等),此时,可通过网络通信实现意图和槽位值的传送。应当理解,其他语音理解系统与应用的布置方式也是可能的,于此对应的,其他能够实现将用户的语音指令中的意图和槽位值传送给对应的目标应用的实现方式也是可能的。
在本发明的另一个实施例中,通信模块1010还用于:
执行步骤150,要求用户输入缺少的槽位值。
执行步骤160,获取第二语音输入。
执行步骤190,向用户反馈目标应用执行对应业务的执行结果。
执行步骤170,将第二语音输入中的槽位值传送给目标应用。
执行步骤180,接收目标应用执行对应业务的反馈。
步骤150至步骤190的具体实现方式在方法实施例中已经进行了详细的描述,为避免重复,此处不再进行赘述。
在一种实现方式中,语音理解系统与业务应用可以共用一个中央处理器(CPU),例如,语音理解系统与所匹配的应用安装在一个智能终端上(例如:Siri语音助手),可以通过语音理解系统实现对智能终端设备上的各种应用的功能进行控制,例如:播放音乐、接挂电 话、查询天气等。这里的智能终端可以是台式计算机、电视机、平板电脑、膝上型电脑、智能手机、电子阅读器、智能手表和智能眼镜等。在另一种实现方式中,语音理解系统可布置在云侧,云侧的语音理解系统可帮助一个或多个端侧设备执行语音理解的功能。当将语音理解系统布置在云端时,云端的语音理解系统主要执行语音理解的功能,而端侧的设备主要用于与目标应用进行交互和与用户进行交互。端侧设备将从注册应用接收到的意图信息和槽位信息发送给云侧设备,云侧设备将从一个或多个端侧设备接收到的意图信息和槽位信息进行储存。当端侧设备接收到语音信息时,可将语音信息发送给端侧设备,由端侧设备根据存储的意图信息和槽位信息识别语音信息的意图和语音信息种的槽位值,并将识别到的意图和槽位值发送给端侧设备,由端侧设备与目标应用进行交互,执行后续操作。当语音信息中缺少部分槽位信息时,由端侧设备请求用户输入缺少的槽位信息,并将用户重新输入的语音信息发送给云侧设备进行语音理解。在另一种实现方式中,语音理解系统可以为一个单独的控制装置,用于对其他外围设备进行控制(例如:小度人工智能助手等),可以通过语音理解装置实现对外围设备的功能进行控制。例如,语音理解装置可以位于室内,对室内的各种家用电器进行控制;语音理解系统可以位于车内,对座舱内的各种硬件系统进行控制等。应当理解,语音理解系统的其他的可能的实现方式也是可能的。
基于前述实施例相同的技术构思,请参阅图11,在本发明一个实施例中,提供一种语义理解装置1100,该语义理解装置包含如下模块:
输入输出设备1110,用于接收用户的第一语音输入。
在一种实现方式中,输入输出设备1110可以包含麦克风等语音输入设备,用于接收用户的第一语音输入,实现步骤110的内容。
处理器1120,用于接收应用的注册请求中的数据包。
存储器1130,用于储存应用的数据包,该数据包来源于应用的注册请求,该数据包包含应用的意图信息和槽位信息。
语音理解系统在接收到新接入的应用的注册请求后,将注册请求中的数据包保存在综合语义库中,供后续进行语义识别时调用。具体的,综合语义库中包含一个或多个应用的数据包,其中应用与数据包可以为一一对应的关系,即一个应用对应与一个包含该应用执行业务所需得意图信息和槽位信息得数据包。综合语义库处于一个不断更新的状态,每接入一个新的应用,都会将该新接入的应用对应的数据包保存在综合语义库中,即对综合语义库进行一次更新,使得综合语义库始终包含所有接入应用的数据包。
处理器1120,还用于根据综合语义库中的意图信息,识别用户的第一语音输入的意图,根据综合语义库中的槽位信息,识别用户的第一语音输入中的槽位值,具体的实现方法如图6所述,上述实施例中已经对语义理解模型的工作过程进行了详细的描述,为避免重复,这里不再进行赘述。
在一种实现方式中,根据综合语音库中的意图信息识别用户的第一语音输入中的意图信息,和根据综合语音库中的槽位信息识别用户的第一语音输入中的槽位值得步骤可由语义理解模型执行。
处理器1120还用于,根据用户的第一语音输入的意图,确定用户的第一语音输入的目标应用;
在一种实现方式中,仅有一个应用包含从用户的第一语音输入中识别到的意图时,此时,可直接确定该应用为用户的第一语音输入的目标应用。在另一种实现方式中,一个语音理解系统可能同时接入多个执行相同或相似业务功能的应用,此时,同一个意图可能对应于多个目标应用。例如:“预定机票”的意图可能同时对应于携程、飞猪、美团等多个目标应用。此时,在一种实现方式中,可根据用户的使用频率对多个目标应用进行排序,在另一种实现方式中,可根据用户评分对多个目标应用进行排序,在另一种实现方式中,还可以综合考虑用户的使用频率和用户评分对多个目标应用进行排序。应当理解,其他可能的对目标应用进行排序的方式也是可能的。在对多个目标应用进行排序后,可选定排名第一的目标应用作为用户的第一语音指令所对应的目标应用,将用户的语音输入中的意图和槽位值发送给该目标应用,由该目标应用执行相应业务。另外,在确定目标应用时,也可以将用户的语音输入中包含的槽位值纳入考虑,在综合考虑用户的语音输入中的意图和槽位值的基础上,确定目标应用。
在一种实现方式中,若未能正确识别用户的语音输入中的意图信息,则输出语音理解失败的反馈。该反馈可以采用任何可能的方式,例如:可通过屏幕显示或语音播报的方式发出语音理解失败的警告,也可通过屏幕显示或语音播报的方式要求用户重新进行语音输入。应当理解,其他能够使用户知晓语音识别失败的反馈方式也是可能的。
处理器1120还用于,将用户的第一语音输入中的意图和槽位值发送给对应的目标应用。
在本发明的另一个实施例中,输入输出设备1110还用于:
执行步骤160,获取第二语音输入。
执行步骤150,要求用户输入缺少的槽位值。
执行步骤190,向用户反馈目标应用执行对应业务的执行结果。
在一种实现方式中,执行步骤160和步骤190的人机交互接口可以包括喇叭等语音输出设备。语音理解系统首先生成文字指令,随后利用语音生成系统(TTS)将文字指令转化为语音,向用户播报。
处理器1120还用于:
执行步骤180,接收目标应用执行对应业务的反馈。
执行步骤170,将第二语音输入中的槽位值传送给目标应用。
步骤150至步骤190的具体实现方式在方法实施例中已经进行了详细的描述,为避免重复,此处不再进行赘述。
在一种实现方式中,语音理解装置可以为一个单独的控制装置,用于对其他外围设备进行控制(例如:小度人工智能助手等),可以通过语音理解装置实现对外围设备的功能进行控制。例如,语音理解装置可以位于室内,对室内的各种家用电器进行控制;语音理解系统可以位于车内,对座舱内的各种硬件系统进行控制等。应当理解,语音理解系统的其他的可能的实现方式也是可能的。
基于前述实施例相同的技术构思,请参阅图12,在本发明一个实施例中,提供一种语音理解设备1200。具体的,语音理解设备1200包括:存储器1210和处理器1220(其中语音理解设备1200中的处理器1220的数量可以是一个或多个,图12中以一个处理器为例)。存储器1210可以包括只读存储器和随机存取存储器,并向处理器1220提供指令和数据。
存储器1210的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储1210存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。
上述本申请实施例揭示的方法可以应用于处理器1220中,或者由处理器1220实现。处理器1220可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1220中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1220可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1320可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器1210,处理器1220读取存储器1210中的信息,结合其硬件完成上述方法的步骤。
本发明实施例还提供一种计算机可读存储介质,通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CLU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。当存储介质中的与扫码方法对应的计算机程序指令被电子设备读取或被执行时,可以实现图1-图7所示方法实施例中描述的各个步骤及过程,此处不再一一赘述。
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如, DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
可以理解,在本申请所提供的几个实施例中,所揭露的方法、装置、设备、计算机存储介质和计算机程序产品,可以通过其它的方式实现。例如,以上所描述的装置的实施例仅仅是示意性的,所述模块的划分,仅仅为一种逻辑功能划分,具体实施时可以有另外的划分方式。例如,多个模块可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。在一种实施方式中,所述的装置以可执行的程序模块的形式存储于存储器中,并由处理器调用和执行,从而通过所述处理器控制所述语义理解装置中的各模块执行对应的操作,以实现语义理解系统与新业务应用的接口匹配操作。
可以理解,本发明实施例所述的方法中的步骤可以根据实际需要进行顺序调整、合并和删减。相应地,本发明实施例所述的装置中的模块也可以根据实际需要进行合并、划分和删减。
以上所揭露的仅为本发明的优选实施例而已,当然不能以此来限定本发明之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本发明权利要求所作的等同变化,仍属于发明所涵盖的范围。

Claims (23)

  1. 一种语音理解方法,其特征在于,包括:
    获取第一语音信息;
    根据综合语义库中的意图信息,匹配所述第一语音信息的意图;
    根据所述第一语音信息的意图,确定目标应用;
    根据所述综合语义库中的槽位信息,识别所述第一语音信息中的槽位值;
    所述综合语义库中的意图信息和槽位信息来源于多个注册应用;
    将所述第一语音信息的意图和所述第一语音信息中的槽位值传送给所述目标应用。
  2. 如权利要求1所述的方法,其特征在于,所述方法还包括:
    根据所述第一语音信息的意图和所述第一语音信息中的槽位值,显示所述目标应用的第一界面。
  3. 如权利要求1或2所述的方法,其特征在于,所述将所述第一语音信息的意图和所述第一语音信息中的槽位值传送给所述目标应用后,所述方法还包括:
    输出所述目标应用执行目标操作的结果,所述目标操作为由所述第一语音信息的意图和所述第一语音信息中的槽位值确定的操作。
  4. 如权利要求1-3任一所述的方法,其特征在于,所述根据所述第一语音信息的意图,确定目标应用,具体为:
    根据所述第一语音信息的意图,确定多个待选应用;
    根据所述多个待选应用各自的用户使用频率或评分确定所述目标应用。
  5. 如权利要求1-4任一所述的方法,其特征在于,所述方法还包括:
    接收所述目标应用的反馈信息,所述反馈信息中包括所述第一语音信息中缺少的槽位信息;
    请求用户输入所述第一语音信息中缺少的槽位信息。
  6. 如权利要求1-5任一所述的方法,其特征在于:
    所述综合语义库中的一个所述意图信息对应至少一个槽位包,所述一个槽位包中包括一个或多个所述槽位信息。
    所述根据综合语义库中的槽位信息,识别所述第一语音信息中的槽位值,具体为:
    根据所述第一语音信息的意图对应的槽位包中的槽位信息,识别所述第一语音输入中的槽位值。
  7. 如权利要求6所述的方法,其特征在于,所述综合语义库中的一个所述意图信息对应至少一个槽位包,所述一个槽位包中包括一个或多个所述槽位信息,具体为:
    所述一个槽位包中包含一个或多个必选槽位信息,所述必选槽位信息为执行所述意图 信息对应的意图所必须的槽位信息;
    当从所述第一语音信息中识别到的槽位值的数量少于目标槽位包中的必选槽位信息的数量,请求用户输入所述第一语音信息中缺少的槽位信息,所述目标槽位包为所述第一语音信息的意图对应的槽位包。
  8. 如权利要求5或7所述的方法,其特征在于,所述请求用户输入所述第一语音信息中缺少的槽位信息之后,所述方法还包括:
    响应于所述请求,获取第二语音信息;
    根据所述缺少的槽位信息,识别所述第二语音信息中的槽位值;
    将所述第二语音信息中的槽位值传送给所述目标应用。
  9. 一种语音理解设备,其特征在于,包括:
    通信模块,用于接收第一语音信息;
    数据存储模块,用于储存意图信息和槽位信息,所述意图信息和所述槽位信息来源于多个注册应用;
    处理模块,用于根据所述数据存储模块中的意图信息,匹配所述第一语音信息的意图,根据所述第一语音信息的意图,确定目标应用,根据所述数据存储模块中的槽位信息,识别所述第一语音输入中的槽位值;
    所述通信模块,还用于发送所述第一语音信息的意图和所述第一语音信息中的槽位值。
  10. 如权利要求9所述的语音理解设备,其特征在于,所述处理模块根据所述第一语音信息的意图,确定目标应用,具体为:
    所述处理模块根据所述第一语音信息的意图,确定多个待选应用,并根据所述多个待选应用各自的用户使用频率或评分确定所述目标应用。
  11. 如权利要求9或10所述的语音理解设备,其特征在于:
    所述一个意图信息对应至少一个槽位包,所述一个槽位包中包括至少一个或多个所述槽位信息。
    根据所述槽位信息,识别所述第一语音输入中的槽位值,具体为:
    根据所述第一语音信息的意图对应的槽位包中的槽位信息,识别所述第一语音输入中的槽位值。
  12. 如权利要求11所述的语音理解设备,其特征在于,所述一个意图信息对应至少一个槽位包,所述一个槽位包中包括至少一个或多个所述槽位信息,具体为:
    所述一个槽位包中包含一个或多个必选槽位信息,所述必选槽位信息为执行所述意图信息对应的意图所必须的槽位信息;
    当从所述第一语音信息中识别到的槽位值的数量少于目标槽位包中的必选槽位信息的数量,所述通信模块还用于发送所述第一语音信息中缺少的槽位信息,所述目标槽位包为 所述第一语音信息的意图对应的槽位包。
  13. 如权利要求12所述的语音理解设备,其特征在于,所述通信模块将所述第一语音信息中缺少的槽位信息发送给第二设备后,所述通信模块,还用于获取第二语音信息;
    所述处理模块,还用于根据所述缺少的槽位信息,识别所述第二语音信息中的槽位值;
    所述通信模块,还用于发送所述第二语音信息中的槽位值。
  14. 一种语音理解装置,其特征在于,包括:
    麦克风,用于采集第一语音信息;
    存储器,用于储存意图信息和槽位信息,所述意图信息和所述槽位信息来源于多个注册应用;
    处理器,用于根据所述意图信息,匹配所述第一语音信息的意图,根据所述第一语音信息的意图,确定目标应用,根据所述槽位信息,识别所述第一语音信息中的槽位值;
    所述处理器,还用于将所述第一语音信息的意图和所述第一语音信息中的槽位值传送给所述目标应用。
  15. 如权利要求14所述的语音理解装置,其特征在于,所述语音理解装置还包括显示屏;
    所述处理器用于根据所述第一语音信息的意图和所述第一语音信息中的槽位值,指示所述显示屏显示所述目标应用的第一界面。
  16. 如权利要求14或15所述的语音理解装置,其特征在于,所述语音理解装置还包括输出装置,用于在所述处理器将所述第一语音信息的意图和所述第一语音信息中的槽位值传送给所述目标应用后,根据处理器的指示输出目标应用执行目标操作的结果,所述目标操作为由所述第一语音信息的意图和所述第一语音信息中的槽位值确定的操作。
  17. 如权利要求14-16任一所述的语音理解装置,其特征在于,所述处理器根据所述第一语音信息的意图,确定目标应用,具体为:
    所述处理器根据所述第一语音信息的意图,确定多个待选应用;
    所述处理器根据所述多个待选应用各自的用户使用频率或评分确定所述目标应用。
  18. 如权利要求14-15、17任一所述的语音理解装置,其特征在于:
    所述处理器,还用于接收所述目标应用的反馈信息,所述反馈信息中包括所述第一语音信息中缺少的槽位信息;
    所述语音理解装置还包括输出装置,用于输出第一请求,所述第一请求用于请求用户输入所述第一语音信息中缺少的槽位信息。
  19. 如权利要求14-18任一所述的语音理解装置,其特征在于:
    所述一个意图信息对应至少一个槽位包,所述一个槽位包中包括一个或多个所述槽位信息。
    所述根据所述槽位信息,识别所述第一语音信息中的槽位值,具体为:
    根据所述第一语音信息的意图对应的槽位包中的槽位信息,识别所述第一语音输入中的槽位值。
  20. 如权利要求19任一所述的语音理解装置,其特征在于,所述一个意图信息对应至少一个槽位包,所述一个槽位包中包括一个或多个所述槽位信息,具体为:
    所述一个槽位包中包含一个或多个必选槽位信息,所述必选槽位信息为执行所述意图信息对应的意图所必须的槽位信息;
    所述语音理解装置还包括输出装置,当从所述第一语音信息中识别到的槽位值的数量少于目标槽位包中的必选槽位信息的数量,所述输出装置用于请求用户输入所述第一语音信息中缺少的槽位信息,所述目标槽位包为所述第一语音信息的意图对应的槽位包。
  21. 如权利要求18或20任一所述的语音理解装置,其特征在于:
    所述请求用户输入所述第一语音信息中缺少的槽位信息之后,所述麦克风还用于采集第二语音信息;
    所述处理器还用于根据所述缺少的槽位信息,识别所述第二语音信息中的槽位值;
    所述处理器还用于将所述第二语音信息中的槽位值传送给所述目标应用。
  22. 一种计算机存储介质,其特征在于,包括计算机指令,当所述计算机指令在计算机上运行时,使得所述计算机执行如权利要求1-8中任一所述的方法。
  23. 一种计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使计算机执行权利要求1-8中任一所述的方法。
PCT/CN2020/139712 2020-12-26 2020-12-26 一种语音理解方法及装置 WO2022134110A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2020/139712 WO2022134110A1 (zh) 2020-12-26 2020-12-26 一种语音理解方法及装置
EP20966673.4A EP4250286A4 (en) 2020-12-26 2020-12-26 METHOD AND DEVICE FOR SPEECH UNDERSTANDING
CN202080004169.8A CN112740323B (zh) 2020-12-26 2020-12-26 一种语音理解方法及装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/139712 WO2022134110A1 (zh) 2020-12-26 2020-12-26 一种语音理解方法及装置

Publications (1)

Publication Number Publication Date
WO2022134110A1 true WO2022134110A1 (zh) 2022-06-30

Family

ID=75609530

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/139712 WO2022134110A1 (zh) 2020-12-26 2020-12-26 一种语音理解方法及装置

Country Status (3)

Country Link
EP (1) EP4250286A4 (zh)
CN (1) CN112740323B (zh)
WO (1) WO2022134110A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115588432A (zh) * 2022-11-23 2023-01-10 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117334183A (zh) * 2022-06-24 2024-01-02 华为技术有限公司 语音交互的方法、电子设备和语音助手开发平台
CN115064166B (zh) * 2022-08-17 2022-12-13 广州小鹏汽车科技有限公司 车辆语音交互方法、服务器和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130246920A1 (en) * 2012-03-19 2013-09-19 Research In Motion Limited Method of enabling voice input for a visually based interface
CN106557461A (zh) * 2016-10-31 2017-04-05 百度在线网络技术(北京)有限公司 基于人工智能的语义解析处理方法和装置
CN108881466A (zh) * 2018-07-04 2018-11-23 百度在线网络技术(北京)有限公司 交互方法和装置
CN109101545A (zh) * 2018-06-29 2018-12-28 北京百度网讯科技有限公司 基于人机交互的自然语言处理方法、装置、设备和介质
CN110136705A (zh) * 2019-04-10 2019-08-16 华为技术有限公司 一种人机交互的方法和电子设备
CN111124866A (zh) * 2019-12-26 2020-05-08 北京蓦然认知科技有限公司 一种语音交互的方法和装置

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7228278B2 (en) * 2004-07-06 2007-06-05 Voxify, Inc. Multi-slot dialog systems and methods
US10431219B2 (en) * 2017-10-03 2019-10-01 Google Llc User-programmable automated assistant
CN107977183A (zh) * 2017-11-16 2018-05-01 百度在线网络技术(北京)有限公司 语音交互方法、装置及设备
CN111508482A (zh) * 2019-01-11 2020-08-07 阿里巴巴集团控股有限公司 语义理解及语音交互方法、装置、设备及存储介质
CN110705267B (zh) * 2019-09-29 2023-03-21 阿波罗智联(北京)科技有限公司 语义解析方法、装置及存储介质
CN111223485A (zh) * 2019-12-19 2020-06-02 深圳壹账通智能科技有限公司 智能交互方法、装置、电子设备及存储介质
CN111696535B (zh) * 2020-05-22 2021-10-26 百度在线网络技术(北京)有限公司 基于语音交互的信息核实方法、装置、设备和计算机存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130246920A1 (en) * 2012-03-19 2013-09-19 Research In Motion Limited Method of enabling voice input for a visually based interface
CN106557461A (zh) * 2016-10-31 2017-04-05 百度在线网络技术(北京)有限公司 基于人工智能的语义解析处理方法和装置
CN109101545A (zh) * 2018-06-29 2018-12-28 北京百度网讯科技有限公司 基于人机交互的自然语言处理方法、装置、设备和介质
CN108881466A (zh) * 2018-07-04 2018-11-23 百度在线网络技术(北京)有限公司 交互方法和装置
CN110136705A (zh) * 2019-04-10 2019-08-16 华为技术有限公司 一种人机交互的方法和电子设备
CN111124866A (zh) * 2019-12-26 2020-05-08 北京蓦然认知科技有限公司 一种语音交互的方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4250286A4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115588432A (zh) * 2022-11-23 2023-01-10 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质
CN115588432B (zh) * 2022-11-23 2023-03-31 广州小鹏汽车科技有限公司 语音交互方法、服务器及计算机可读存储介质

Also Published As

Publication number Publication date
EP4250286A4 (en) 2023-12-27
CN112740323A (zh) 2021-04-30
CN112740323B (zh) 2022-10-11
EP4250286A1 (en) 2023-09-27

Similar Documents

Publication Publication Date Title
WO2022134110A1 (zh) 一种语音理解方法及装置
US10311877B2 (en) Performing tasks and returning audio and visual answers based on voice command
US9530415B2 (en) System and method of providing speech processing in user interface
WO2016004763A1 (zh) 业务推荐方法和具有智能助手的装置
US11934394B2 (en) Data query method supporting natural language, open platform, and user terminal
CN111261151B (zh) 一种语音处理方法、装置、电子设备及存储介质
US11749255B2 (en) Voice question and answer method and device, computer readable storage medium and electronic device
CN110956955A (zh) 一种语音交互的方法和装置
CN111309857A (zh) 一种处理方法及处理装置
JP7288885B2 (ja) 音声対話方法、装置、機器および記憶媒体
WO2023093280A1 (zh) 语音控制方法、装置、电子设备及存储介质
US10529323B2 (en) Semantic processing method of robot and semantic processing device
US11977815B2 (en) Dialogue processing method and device
CN111556096B (zh) 信息推送方法、装置、介质及电子设备
CN110838284B (zh) 一种语音识别结果的处理方法、装置和计算机设备
JP6944920B2 (ja) スマートインタラクティブの処理方法、装置、設備及びコンピュータ記憶媒体
CN114187903A (zh) 语音交互的方法、装置、系统、电子设备及存储介质
CN114297229A (zh) 一种数据查询方法、装置、电子设备及存储介质
CN111580766A (zh) 一种信息显示方法、装置和信息显示系统
US11810573B2 (en) Assisted speech recognition
CN111814484B (zh) 语义识别方法、装置、电子设备及可读存储介质
US20210109960A1 (en) Electronic apparatus and controlling method thereof
CN114512125A (zh) 语音交互的系统、方法、设备及介质
CN117742644A (zh) 语音控制方法、系统、装置、电子设备和可读存储介质
CN116386627A (zh) 显示设备及热词识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20966673

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020966673

Country of ref document: EP

Effective date: 20230619

NENP Non-entry into the national phase

Ref country code: DE