WO2023040658A1 - 语音交互方法及电子设备 - Google Patents

语音交互方法及电子设备 Download PDF

Info

Publication number
WO2023040658A1
WO2023040658A1 PCT/CN2022/115934 CN2022115934W WO2023040658A1 WO 2023040658 A1 WO2023040658 A1 WO 2023040658A1 CN 2022115934 W CN2022115934 W CN 2022115934W WO 2023040658 A1 WO2023040658 A1 WO 2023040658A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
instruction
voice command
user
command
Prior art date
Application number
PCT/CN2022/115934
Other languages
English (en)
French (fr)
Inventor
潘邵武
甘嘉栋
徐传飞
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023040658A1 publication Critical patent/WO2023040658A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M1/00Substation equipment, e.g. for use by subscribers
    • H04M1/72Mobile telephones; Cordless telephones, i.e. devices for establishing wireless links to base stations without route selection
    • H04M1/724User interfaces specially adapted for cordless or mobile telephones
    • H04M1/72403User interfaces specially adapted for cordless or mobile telephones with means for local support of applications that increase the functionality
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/487Arrangements for providing information services, e.g. recorded voice services or time announcements
    • H04M3/493Interactive information services, e.g. directory enquiries ; Arrangements therefor, e.g. interactive voice response [IVR] systems or voice portals

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence (AI), and in particular, to a voice interaction method and an electronic device.
  • AI artificial intelligence
  • Voice assistants such as Siri, Xiao Ai, Xiao E, etc.
  • users can trigger voice assistants to open target applications, play music, and inquire about the weather.
  • the voice interaction function provided by the voice assistant requires the voice assistant to accurately recognize the user's voice commands in order to perform the operation the user wants.
  • the voice commands that the voice assistant can recognize are usually the voice commands that its internal voice recognition algorithm or model can support after training, but sometimes the voice commands issued by the user may be different from the voice commands that the voice assistant can recognize, which may easily lead to voice interaction failure , poor user experience.
  • the embodiment of the present application discloses a voice interaction method and an electronic device, which can improve the ability to recognize voice commands and improve user experience.
  • the embodiment of the present application provides a voice interaction method, which can be applied to voice assistants or electronic devices.
  • the method includes receiving the first voice command, which cannot be effectively recognized, and receiving the second Voice command, and establish the association relationship between the second voice command and the first voice command, the second voice command corresponds to the first response, and receive the third voice command, wherein the content or pronunciation of the third voice command is the same as that of the first voice command, In response to the third voice command, the same first response as the second voice command is performed.
  • the above-mentioned failure to effectively recognize the first voice instruction includes failing to recognize the semantics (as intended) of the first voice instruction. For example, if the first voice instruction is "Biao Gege", the voice assistant on the electronic device cannot recognize that the intention of the first voice instruction is "play music", and the electronic device cannot perform the corresponding operation.
  • the inability to effectively recognize the above-mentioned first voice instruction also includes wrongly recognizing the semantics (as intended) of the first voice instruction.
  • the first voice instruction is "Biao Gegeba”
  • its corresponding intention is "play music”
  • the voice assistant on the electronic device mistakenly recognizes that the intention of the first voice instruction is "turn on the light”
  • the electronic device Execute the operation of turning on the light
  • the above-mentioned first response is the first operation.
  • the first voice instruction "Boo a song” corresponds to "play music”
  • its corresponding first response is "play music”.
  • the electronic device executes the first response, that is, the electronic device executes the first operation.
  • the voice assistant of the electronic device performs speech recognition and semantic understanding of the user's voice command, determines that the user intends to play music, and obtains the information for playing music.
  • the instruction is executed, and the electronic device plays music in response to the instruction.
  • the voice assistant or the electronic device when the voice assistant or the electronic device cannot recognize or correctly recognize the user's intention corresponding to the first voice command output by the user, the voice assistant or the electronic device continues to use the sound pickup device (such as a microphone) to collect the user's voice to receive
  • the second voice command input by the user for explaining or retelling the first voice command can be effectively recognized, that is, the user's intention of the second voice command can be correctly recognized.
  • the first response corresponding to the second voice command is executed.
  • An association relationship between the second voice instruction and the first voice instruction is established, and a third voice instruction having the same content or pronunciation as the first voice instruction can be identified according to the association relationship, and the same first response as the second voice instruction can be executed.
  • receiving the second voice instruction includes: when the recognition result of the first voice instruction cannot be generated, establishing a learning session associated with the first voice instruction; during the learning session, receiving Second voice command.
  • the voice assistant in the prior art when the first voice command output by the user cannot be recognized, the processing of the first voice command will be ended, and the interaction process will be terminated.
  • the voice assistant in the prior art misrecognizes the user's intention corresponding to the first voice command, the user cannot correct the voice assistant's recognition of the real user intention of the first voice command.
  • the learning session associated with the instruction provides an explanation process for the user, and the user can continue to use the voice instruction to interact with the electronic device, so as to express the content of the last first voice instruction to the electronic device or voice assistant again through another language expression content
  • the user expresses or explains the semantics of the last first voice command to the electronic device or the voice assistant, so that the voice assistant can understand the user's intention corresponding to the first voice command.
  • establishing the association relationship between the first voice instruction and the second voice instruction includes: detecting that the second voice instruction is received during the learning session, and establishing the relationship between the first voice instruction and the second voice instruction relationship between commands.
  • the received second voice instruction can be used to explain the first voice instruction by default, and then the association relationship between the first voice instruction and the second voice instruction can be directly established to improve voice interaction efficiency.
  • establishing the association relationship between the first voice command and the second voice command includes: detecting a trigger command; when the trigger command is detected, the second voice command received during the learning session The instruction is associated with the first voice instruction.
  • the second voice command can be used after receiving the trigger command. An association relationship is established with the first voice command to improve the accuracy of the association.
  • receiving the second voice instruction includes: during the learning session, receiving voice input, wherein the voice input includes the second voice instruction and a the first voice content; then establishing the association relationship between the first voice instruction and the second voice instruction includes: establishing the association relationship between the second voice instruction and the first voice instruction when the first voice content is detected.
  • the above-mentioned first speech content may be a preset template, such as "what I mean” or “what I want to express” and so on. Then, when the voice assistant recognizes "what I mean” or "what I want to express", it considers that the voice input includes a second voice instruction for explaining the first voice instruction.
  • the voice command after the preset template can be used as the second voice command.
  • the above-mentioned first speech content can also be expressed in a more flexible language, such as "No, it should be” or "That's not what it means, it is” and so on. That is, without template matching, the voice assistant can determine whether there is a second voice instruction for interpreting the first voice instruction in the voice input by recognizing the user's intention of the first voice content.
  • the user can interpret the first voice command that is not effectively recognized by the electronic device or the voice assistant during the voice interaction process, which improves the user experience and makes it more intelligent and humanized.
  • the method before receiving the second voice instruction, the method further includes:
  • the user is guided to continue the interaction by outputting feedback, which facilitates the user to understand and use the voice interaction method provided by the embodiment of the present application.
  • the method before receiving the second voice instruction, the method further includes: performing a second response in response to the first voice instruction, where the second response is different from the first response.
  • the above-mentioned second response is the operation performed by the voice assistant after misidentifying the user's intention.
  • the second response may also include output feedback after the operation is performed.
  • the voice assistant on the electronic device mistakenly recognizes that the intention of the first voice instruction is "turn on the light”
  • the second response includes The electronic device turns on the light, giving feedback to the user "OK, light is on”.
  • establishing an association relationship between the first voice instruction and the second voice instruction includes: detecting a trigger instruction; when the trigger instruction is detected, establishing an association between the second voice instruction and the first voice instruction connection relation.
  • the voice assistant After the voice assistant recognizes an error, the user takes the initiative to trigger and inform the voice assistant of the recognition error, so as to establish the correlation between the second voice command and the first voice command, and guide the voice assistant to correctly recognize it.
  • receiving the second voice instruction includes: receiving a user's voice input, where the voice input includes the second voice instruction and the second voice instruction used to indicate that there is an error in the recognition result of the first voice instruction. voice content.
  • establishing an association relationship between the first voice instruction and the second voice instruction includes: establishing an association relationship between the second voice instruction and the first voice instruction when the second voice content is detected .
  • the above-mentioned second speech content may be a preset template, such as "what I mean” or “what I want to express” and so on. Then, when the voice assistant recognizes "what I mean” or "what I want to express", it considers that the voice input includes a second voice instruction for modifying the first voice instruction.
  • the voice command after the preset template can be used as the second voice command.
  • the above-mentioned second speech content can also be expressed in a more flexible language, such as "No, it should be” or "That's not what it means, it is” and so on. That is, without template matching, the voice assistant can determine whether there is a second voice instruction for modifying the first voice instruction in the voice input by recognizing the user's intention of the first voice content.
  • the user can correct the first voice command effectively recognized by the electronic device or the voice assistant during the voice interaction process, which improves the user experience and makes it more intelligent and humanized.
  • establishing the association relationship between the first voice instruction and the second voice instruction includes: equating the first voice instruction to the second voice instruction, or A response is associated with the first voice command. That is to directly equate the content of the first voice command such as "Biao a song” with the second voice command "Play music”, or the first response of the second voice command "Play music” and the first voice command "Biao a Song bar” association.
  • the voice interaction method further includes: generating a training data set according to the association relationship; using the training data set to train the model of the voice assistant, so that the voice assistant can process and adapt the user's language habits voice commands.
  • the association relationship more training data sets about the user's language habits can be generated, and these training data sets are used to train the voice assistant, so that the voice assistant can process voice instructions adapted to the user's language habits.
  • generating the training data set according to the association relationship includes: uploading the association relationship to the cloud server; receiving the association relationship uploaded by the group users to generate a training data set adapted to the language habits of the group users .
  • the association relationship more training data sets that conform to the language habits of the group users can be generated, and these training data sets are used to train the voice assistant, so that the voice assistant can process voice commands that adapt to the language habits of the group users.
  • the voice interaction method further includes: receiving a fourth voice instruction, where the content or pronunciation of the fourth voice instruction is not exactly the same as that of the first voice instruction, and the fourth voice instruction is different from the first voice instruction.
  • the content or pronunciation similarity of the voice command is within the first range, and in response to the fourth voice command, perform the same first response as the second voice command.
  • the first range may be the range in which the fourth voice instruction is substantially the same as the first voice instruction based on the robustness of the voice assistant, or it may be that the voice assistant determines that the fourth voice instruction is similar to the first voice instruction. degree range.
  • the fourth voice instruction "Biaogeba” may be recognized as substantially the same voice instruction as the first voice instruction "Biaogeba”.
  • a third voice instruction or a fourth voice instruction may be received.
  • the third voice command and the fourth voice command can be received, such as receiving the third voice command before receiving the fourth voice command, or receiving the fourth voice command first.
  • the fourth voice command receives the third voice command.
  • an embodiment of the present application provides a computer-readable storage medium, where the computer-readable storage medium contains computer-executable instructions for performing any one of the above methods.
  • an embodiment of the present application provides a system, and the system includes: the computer-readable storage medium provided in the second aspect; and a processor capable of executing computer-executable instructions.
  • an embodiment of the present application provides an electronic device, including: at least one memory for storing a program; and at least one processor for executing the program stored in the memory, when the program is executed by the processor, the electronic device The device executes any one of the above methods.
  • FIG. 1 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 2 is a schematic structural diagram of a voice assistant provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a wake-up scene of a voice assistant provided by an embodiment of the present application.
  • 5(a)-5(b) are schematic diagrams of a first voice command interaction scenario provided by the embodiment of the present application.
  • Figures 6(a)-6(d) are schematic diagrams of scenarios of feedback to the first voice instruction provided by the embodiment of the present application.
  • FIG. 7(a)-7(c) are schematic diagrams of a second voice command interaction scenario provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of a scene of feedback to a second voice instruction provided by an embodiment of the present application.
  • 9(a)-9(b) are schematic diagrams of a third voice command interaction scenario provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of an application scenario of an association table provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of another application scenario of an association table provided by the embodiment of the present application.
  • FIG. 12 is a schematic diagram of another application scenario of an association table provided by the embodiment of the present application.
  • FIG. 13 is a schematic diagram of another application scenario of an association table provided by the embodiment of the present application.
  • FIG. 14 is a schematic diagram of another application scenario of an association table provided by the embodiment of the present application.
  • first and second are used for descriptive purposes only, and cannot be understood as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, a feature defined as “first” and “second” may explicitly or implicitly include one or more of these features. In the description of this embodiment, unless otherwise specified, “plurality” means two or more.
  • words such as “exemplarily”, “for example” or “in some examples” are used to represent examples, illustrations or illustrations. Any embodiment or design solution described as “exemplary” or “for example” in the embodiments of the present application shall not be interpreted as being more preferred or more advantageous than other embodiments or design solutions. Rather, use of words such as “exemplarily,” “for example,” or “in some examples” is intended to present related concepts in a concrete manner.
  • the voice command issued by the user is a voice command that cannot be recognized by the voice assistant, it will lead to poor user experience.
  • the embodiment of the present application provides a voice interaction method.
  • the voice assistant cannot recognize the voice command output by the user, the user can use the voice command that the voice assistant can recognize to explain the unrecognizable voice command, and the voice assistant can improve the understanding of the voice command based on the explanation.
  • the recognition of voice commands that cannot be recognized realizes the automatic adaptation of voice assistants and expands the recognition ability of voice commands.
  • the electronic device in the embodiment of the present application may be a portable computer (such as a mobile phone), a notebook computer, a personal computer (PC), a wearable electronic device (such as a smart watch), a tablet computer, a smart home device, an augmented reality (augmented reality, AR) ⁇ virtual reality (virtual reality, VR) equipment, artificial intelligence (artificial intelligence, AI) terminal (such as intelligent robot), vehicle-mounted computer, etc.
  • augmented reality augmented reality, AR
  • VR virtual reality
  • AI artificial intelligence
  • vehicle-mounted computer etc.
  • FIG. 1 shows a schematic structural diagram of an electronic device.
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, and an antenna 2 , mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, earphone jack 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and A subscriber identification module (subscriber identification module, SIM) card interface 195 and the like.
  • SIM subscriber identification module
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100 .
  • the electronic device 100 may include more or fewer components than shown in the figure, or combine certain components, or separate certain components, or arrange different components.
  • the illustrated components can be realized in hardware, software or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), controller, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural network processor (neural-network processing unit, NPU), etc. Wherein, different processing units may be independent devices, or may be integrated in one or more processors.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller video codec
  • digital signal processor digital signal processor
  • baseband processor baseband processor
  • neural network processor neural-network processing unit
  • the DSP can monitor the voice data in real time, and when the similarity between the voice data monitored by the DSP and the wake-up word registered in the electronic device satisfies a preset condition, the voice data can be handed over to the AP.
  • the AP performs text verification and voiceprint verification on the above voice data.
  • the electronic device can start the voice assistant.
  • the controller can generate operation control signals according to instruction opcodes and timing signals, and complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 for storing instructions and data.
  • the memory in processor 110 is a cache memory.
  • the memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Repeated access is avoided, and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • the charging management module 140 is configured to receive a charging input from a charger.
  • the charger may be a wireless charger or a wired charger.
  • the charging management module 140 can receive charging input from the wired charger through the USB interface 130 .
  • the charging management module 140 may receive a wireless charging input through a wireless charging coil of the electronic device 100 . While the charging management module 140 is charging the battery 142 , it can also provide power for electronic devices through the power management module 141 .
  • the power management module 141 is used for connecting the battery 142 , the charging management module 140 and the processor 110 .
  • the power management module 141 can receive input from the battery 142 and/or the charging management module 140 to provide power for the processor 110 , the internal memory 121 , the display screen 194 , the camera 193 , and the wireless communication module 160 .
  • the power management module 141 can be used to monitor performance parameters such as battery capacity, battery cycle times, battery charging voltage, battery discharging voltage, battery health status (such as leakage, impedance). In some other embodiments, the power management module 141 may also be disposed in the processor 110 . In some other embodiments, the power management module 141 and the charging management module 140 may also be set in the same device.
  • the wireless communication function of the electronic device 100 can be realized by the antenna 1 , the antenna 2 , the mobile communication module 150 , the wireless communication module 160 , a modem processor, a baseband processor, and the like.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover single or multiple communication frequency bands. Different antennas can also be multiplexed to improve the utilization of the antennas.
  • Antenna 1 can be multiplexed as a diversity antenna of a wireless local area network.
  • the antenna may be used in conjunction with a tuning switch.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G applied on the electronic device 100 .
  • the mobile communication module 150 may include one or more filters, switches, power amplifiers, low noise amplifiers (low noise amplifier, LNA) and the like.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, filter and amplify the received electromagnetic waves, and send them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signals modulated by the modem processor, and convert them into electromagnetic waves through the antenna 1 for radiation.
  • at least part of the functional modules of the mobile communication module 150 may be set in the processor 110 .
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be set in the same device.
  • a modem processor may include a modulator and a demodulator.
  • the modulator is used for modulating the low-frequency baseband signal to be transmitted into a medium-high frequency signal.
  • the demodulator is used to demodulate the received electromagnetic wave signal into a low frequency baseband signal. Then the demodulator sends the demodulated low-frequency baseband signal to the baseband processor for processing.
  • the low-frequency baseband signal is passed to the application processor after being processed by the baseband processor.
  • the application processor outputs sound signals through audio equipment (not limited to speaker 170A, receiver 170B, etc.), or displays images or videos through display screen 194 .
  • the modem processor may be a stand-alone device.
  • the modem processor may be independent from the processor 110, and be set in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide wireless local area networks (wireless local area networks, WLAN) (such as wireless fidelity (Wireless fidelity, Wi-Fi) network), bluetooth (Bluetooth, BT), global navigation satellite, etc. System (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field communication technology (near field communication, NFC), infrared technology (infrared, IR) and other wireless communication solutions.
  • the wireless communication module 160 may be one or more devices integrating one or more communication processing modules.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2 , frequency-modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110 .
  • the wireless communication module 160 can also receive the signal to be sent from the processor 110 , frequency-modulate it, amplify it, and convert it into electromagnetic waves through the antenna 2 for radiation.
  • the antenna 1 of the electronic device 100 is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access, CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), BT, GNSS, WLAN, NFC , FM, and/or IR techniques, etc.
  • GSM global system for mobile communications
  • GPRS general packet radio service
  • code division multiple access code division multiple access
  • CDMA broadband Code division multiple access
  • WCDMA wideband code division multiple access
  • time division code division multiple access time-division code division multiple access
  • TD-SCDMA time-division code division multiple access
  • the GNSS may include a global positioning system (global positioning system, GPS), a global navigation satellite system (global navigation satellite system, GLONASS), a Beidou navigation satellite system (beidou navigation satellite system, BDS), a quasi-zenith satellite system (quasi -zenith satellite system (QZSS) and/or satellite based augmentation systems (SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • Beidou navigation satellite system beidou navigation satellite system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite based augmentation systems
  • the electronic device 100 realizes the display function through the GPU, the display screen 194 , and the application processor.
  • the GPU is a microprocessor for image processing, and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos and the like.
  • the display screen 194 includes a display panel.
  • the electronic device 100 may include 1 or N display screens 194 , where N is a positive integer greater than 1.
  • the electronic device 100 can realize the shooting function through the ISP, the camera 193 , the video codec, the GPU, the display screen 194 and the application processor.
  • the ISP is used for processing the data fed back by the camera 193 .
  • the light is transmitted to the photosensitive element of the camera through the lens, and the light signal is converted into an electrical signal, and the photosensitive element of the camera transmits the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • ISP can also optimize the algorithm for image noise and brightness.
  • ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be located in the camera 193 .
  • Camera 193 is used to capture still images or video.
  • the mobile phone 100 may include 1 or N cameras, where N is a positive integer greater than 1.
  • the camera 193 can be a front camera or a rear camera.
  • the NPU is a neural-network (NN) computing processor.
  • NN neural-network
  • Applications such as intelligent cognition of the electronic device 100 can be realized through the NPU, such as image recognition, face recognition, speech recognition, text understanding, and the like.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, so as to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. Such as saving music, video and other files in the external memory card.
  • the internal memory 121 may be used to store one or more computer programs including instructions.
  • the processor 110 may execute the above-mentioned instructions stored in the internal memory 121, so that the electronic device 100 executes the voice interaction method provided in some embodiments of the present application, as well as various functional applications and data processing.
  • the internal memory 121 may include an area for storing programs and an area for storing data. Wherein, the stored program area can store an operating system; the stored program area can also store one or more application programs (such as voice recognition, gallery, contacts, etc.) and the like.
  • the storage data area can store data and the like created during use of the electronic device.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as one or more magnetic disk storage devices, flash memory devices, universal flash storage (universal flash storage, UFS) and the like.
  • the processor 110 causes the electronic device 100 to execute the voice provided in the embodiment of the present application by executing the instructions stored in the internal memory 121 and/or the instructions stored in the memory provided in the processor. Interactive methods, and various functional applications and data processing.
  • the electronic device 100 can implement audio functions through the audio module 170 , the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signal.
  • the audio module 170 may also be used to encode and decode audio signals.
  • the audio module 170 may be set in the processor 110 , or some functional modules of the audio module 170 may be set in the processor 110 .
  • Speaker 170A also referred to as a "horn" is used to convert audio electrical signals into sound signals.
  • Electronic device 100 can listen to music through speaker 170A, or listen to hands-free calls.
  • Receiver 170B also called “earpiece” is used to convert audio electrical signals into sound signals.
  • the receiver 170B can be placed close to the human ear to receive the voice.
  • the microphone 170C also called “microphone” or “microphone” is used to convert sound signals into electrical signals. When making a phone call or sending a voice message, the user can put his mouth close to the microphone 170C to make a sound, and input the sound signal to the microphone 170C.
  • the electronic device 100 may be provided with one or more microphones 170C. In some other embodiments, the electronic device 100 may be provided with two microphones 170C, which may also implement a noise reduction function in addition to collecting sound signals. In some other embodiments, the electronic device 100 can also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions, etc.
  • the earphone interface 170D is used for connecting wired earphones.
  • the sensor module 180 may include a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, etc. There are no restrictions on this.
  • the electronic device 100 provided in the embodiment of the present application may also include one or more components such as the button 190 , the motor 191 , the indicator 192 and the SIM card interface 195 , which is not limited in the embodiment of the present application.
  • the "voice assistant” involved in the embodiment of the present application can also be called “digital assistant”, “virtual assistant”, “intelligent automated assistant” or “automatic digital assistant”, etc.
  • a “voice assistant” can be understood as an information processing system that can recognize natural language input in the form of speech and/or text to infer user intent and perform corresponding actions based on the inferred user intent. The system may output responses to the user's input in audible (eg, voice) and/or visual form.
  • a user may ask a voice assistant a question such as "Where am I now?" Based on the user's current location, the voice assistant may answer “You are near the West Gate of Central Park.” The user may also request to perform a task, such as "Call Mike In response, the voice assistant may confirm the request by saying "Ok, right now” and the voice assistant performs the task of dialing the phone of the contact "Mike".
  • voice assistants may provide responses in other visual or audio forms (eg, as text, prompts, music, video, animation, etc.). It can be understood that the user and the voice assistant can also perform other types of interactions, such as chatting, games, knowledge quizzes, etc., and the interaction form is not limited, which is not limited in the embodiment of the present application.
  • FIG. 2 is a functional architecture diagram of the voice assistant provided by the embodiment of the present application. Each functional module in the voice assistant is described below. As shown in FIG. 2
  • the front-end processing module 21 is used to process the voice instruction input by the user into a data format required by the subsequent algorithm, such as an audio feature vector, for use by the ASR module 22.
  • the front-end processing module 21 obtains the voice instruction input by the user, it performs audio decoding on the voice instruction, decodes it into an audio signal in pcm format, and then uses voiceprint or other features to separate, denoise, and feature the audio signal. Extract and obtain the audio feature vector of the mel-frequency cepstral coefficients (MFCC) filter bank through audio processing algorithms such as framing, windowing, and short-time Fourier transform.
  • the front-end processing module 21 is generally disposed on the terminal side. It can be understood that the voice assistant may not include an independent front-end processing module 21 , for example, the functions of the front-end processing module 21 may be integrated in the voice recognition module 22 .
  • the speech recognition (automatic speech recognition, ASR) module 22 is used for obtaining the audio feature vector obtained by the front-end processing module 21, and converting the audio feature vector into text for the natural language understanding module 23 to understand.
  • the ASR module 22 is used for recognizing and outputting text recognition results. As the ASR module 22 uses one or more speech recognition models to process the audio feature vectors extracted by the front-end processing module 21 to generate intermediate recognition results (for example, phonemes, phoneme strings and subwords), and finally generate text recognition results (for example, words, strings of words, or sequences of symbols).
  • intermediate recognition results for example, phonemes, phoneme strings and subwords
  • text recognition results for example, words, strings of words, or sequences of symbols.
  • the one or more speech recognition models may include hidden Markov models, Gaussian mixture models, deep neural network models, n-gram language models or other statistical models.
  • the acoustic model is used to classify acoustic features to (decode) phonemes or words, and the language model is used to decode phonemes or words into a complete text.
  • the acoustic model and the language model process the audio feature vectors in series, convert the audio feature vectors into intermediate recognition results (for example, phonemes, phoneme strings, and subwords) through the acoustic model, and then use the language model to convert the phonemes Or convert words to generate text recognition results (for example, words, word strings, or symbol sequences), and output text or symbol sequences corresponding to user voice instructions.
  • intermediate recognition results for example, phonemes, phoneme strings, and subwords
  • text recognition results for example, words, word strings, or symbol sequences
  • the natural language understanding (NLU) module 23 is used to perform semantic recognition on the text or symbol sequence corresponding to the user voice instruction to obtain semantic information. That is to convert the text or symbol sequence corresponding to the user's voice into structured information, where the structured information includes skills, machine-executable intention information, and recognizable slot information.
  • the purpose of the NLU module 23 is to obtain the semantic representation of the natural language input by the user through the analysis of syntax, semantics and pragmatics.
  • the NLU module 23 may perform skill classification, intent classification, and slot extraction on the text or symbol sequence corresponding to the user's voice.
  • a voice assistant can integrate multiple specific skills.
  • the voice assistant can maintain a skill list.
  • the skill list includes skill A, skill B, and skill N in Figure 2.
  • Each skill corresponds to a type of service Or functions, such as: meal ordering service, taxi service, querying the weather, etc.
  • One or more intents can be configured under each skill.
  • the "weather query" skill can be configured with: Q&A intent "check the weather”.
  • One or more slots can be configured under each intent.
  • the question-and-answer intent "check the weather” can be configured with time slots and city slots.
  • a skill can be a service or a function, such as weather query service, flight ticket booking service, and so on.
  • Skills can be configured by developers of third-party applications (such as "weather") or third-party platforms.
  • One or more intents can be configured under a skill.
  • An intent can be a more granular service or function under a skill. Intentions can be divided into dialogue intentions and question-and-answer intentions. Dialogue intentions should be used for parameters that need to be parameterized, such as ordering train ticket intentions, which require parameters such as train number and departure time, and dialogue intentions should be used.
  • the Q&A intent prefers to solve Frequently Asked Questions (FAQ) type questions. For example, how to charge the refund fee? One or more slots can be configured in an intent.
  • FAQ Frequently Asked Questions
  • the slot is the key information used to express the user's intention in the user sentence.
  • the user's intention is the dialogue intention "check the weather”
  • the slots that the NLU module 23 needs to extract from the user's voice command are the city slot and the time slot.
  • the city slot is used to indicate the weather of "where”
  • the time slot is used to indicate the weather of "what day”.
  • Dialog Management (Dialog Management, DM) module 24 is used for outputting the next action according to the semantic information and dialog status output by NLU module 22, such as including judging that the service/platform should be accessed, the feedback operation taken or the response information of reply.
  • the DM module 24 can be used to maintain and update the dialogue state, and can be used to determine the next action according to the dialogue state and semantic information.
  • the DM module 24 may consist of multiple sub-modules.
  • the DM module 24 obtains the task corresponding to the voice instruction according to the semantics output by the NLU module 23, and then connects to the service platform 27 to complete the task; or, the DM module 24 requires the user to further input more or, the DM module 24 obtains the information requested by the voice command and returns it to the user.
  • different skills output by the DM module 24 can be connected to different business platforms 27. For example, if the semantic information is listening to songs, it can be connected to a music playback platform; if the semantic information is to watch videos, it can be connected to a video playback platform.
  • the natural language understanding (NLG) module 25 is used to textualize the system action output by the DM module 24 to obtain a natural language text and provide it to the TTS module 26.
  • the speech synthesis (Text-to-Speech, TTS) module 26 is used for further converting the natural language text generated by the NLG module 25 into a playable response speech output signal.
  • the electronic device may perform corresponding operations according to the instructions output by the DM module 24 .
  • the instruction output by the DM module 24 is an instruction for instructing to output voice.
  • the NLG module 25 can generate voice information according to the instruction output by the DM module 24, and the TTS module 26 outputs the voice.
  • the voice information input by the user is "play a song”
  • the DM module 24 outputs an instruction for instructing the output voice
  • the NLG module 25 generates the output voice text "what song do you want to play?” according to the instruction for instructing the output voice?
  • the TTS module 26 outputs the voice text output voice "what song do you want to play?” according to the NLG module 25, and the voice is played by the electronic device.
  • the electronic device performs corresponding operations in response to the instruction.
  • the output of the DM module 24 may be embodied as an execution instruction, and the execution instruction is used to indicate the next action.
  • the voice information input by the user is "play song A”
  • the DM module 24 outputs an execution instruction to play song A, and the electronic device automatically plays song A in response to the execution instruction.
  • the voice assistant uses the voice assistant to control the device to turn on the light as an example to describe the processing flow of the voice assistant.
  • the voice assistant may be an application, or a service, or a functional module (such as an API interface) integrated in other applications or services, which is not limited in this embodiment of the present invention.
  • An electronic device equipped with a voice assistant receives a voice command input by the user (such as "turn on the light"), and the voice assistant invokes the ASR module 22, the NLU module 23, and the DM module 24 to identify the intention corresponding to the user's voice command, And map it to the corresponding skill (such as turning on the light skill), the voice assistant sends the skill execution request to the corresponding business logic processing system (such as the control platform) through the corresponding skill service interface according to the skill mapping result, and the business logic processing The system executes the request according to the skill, controls the corresponding device/platform (such as a light) to perform the corresponding service (such as turning on the light), and the electronic device provides service feedback to the user (such as voice broadcast "the light is turned on”).
  • the voice assistant can also directly control the switch of the light without going through a business logic processing system (such as a control platform). This embodiment of the present invention does not specifically limit it.
  • all functional modules of the voice assistant can be deployed on the electronic device.
  • This type of electronic equipment can include intelligent robots, or rich equipment with rich functions such as mobile phones, car machines, and large screens.
  • a part of the functional modules of the voice assistant can be deployed on the electronic device, and a part can be deployed on the server or other devices, for example, the front-end processing module 21 can be deployed on the electronic device.
  • the ASR module may be deployed on the electronic device, or a part of the ASR module may be deployed on the electronic device, and a part of the ASR module may be deployed on a server or other devices.
  • the deployment of the NLU and DM modules may also be similar to the foregoing deployment of the ASR module, which is not specifically limited in this embodiment of the present invention.
  • electronic devices such as mobile phones, car machines, and large screens mentioned above can also be deployed using this type of architecture, and some other thin devices can also be deployed using this type of architecture.
  • voice assistants can be distributed across multiple electronic devices, and cooperate to implement voice interaction functions.
  • the voice assistant may have more or fewer components than shown, may combine two or more components, or may have a different configuration or layout of components.
  • the various functional modules shown in FIG. 2 may be implemented in hardware, software instructions for execution by one or more processors, firmware including one or more signal processing integrated circuits and/or application specific integrated circuits, or a combination thereof realized in.
  • FIG. 3 is a schematic flowchart of a voice interaction method provided by an embodiment of the present application.
  • the voice interaction method includes the following steps:
  • Step S301 the user activates the voice assistant on the mobile phone.
  • the user when the user wants to interact with the mobile phone through voice, the user can first trigger the voice interaction function in the mobile phone, such as the user starts the voice assistant in the mobile phone, so that the voice assistant is in a working state.
  • step S301 may be omitted, for example, the voice interaction function (such as voice assistant) may not be activated, and the user may directly perform voice interaction with the voice assistant.
  • the voice interaction function such as voice assistant
  • Starting the voice assistant can include but not limited to the following methods:
  • Method 1 The user can start (wake up) the voice assistant by voice.
  • the voice data for waking up the voice assistant may be called a wake-up word (or wake-up voice).
  • the wake-up word can be pre-registered in the mobile phone.
  • the wake-up word of Huawei voice assistant Xiaoyi is "Xiaoyi, Xiaoyi”.
  • the mobile phone equipped with the voice assistant Xiaoyi can set the microphone to always on (always on), and then, the mobile phone can detect the voice signal input by the user in real time through the microphone.
  • the mobile phone When the voice signal of the user inputting the wake-up word "Xiaoyi, Xiaoyi" is detected, the mobile phone can wake up the voice assistant Xiaoyi installed in the mobile phone, so that the voice assistant Xiaoyi can receive the user's voice command. After the voice assistant Xiaoyi is awakened, it can respond to the wake-up word "Xiaoyi, Xiaoyi" input by the user, output a response of "Xiaoyi is here", and start receiving voice commands input by the user. As shown in FIG. 4 , a dialogue interface 501 of the voice assistant can be displayed on the mobile phone, and the dialogue content between the user and the voice assistant Xiaoyi can be displayed in the dialogue interface 501 in real time.
  • Method 2 The user can start the voice assistant by touch, for example, by pressing and holding the home button, clicking the power button or the application icon of the voice assistant on the mobile phone interface to start the voice assistant.
  • Step S302 the user sends a first voice command to the mobile phone.
  • the user sends a voice command 1 to the mobile phone.
  • the voice assistant on the mobile phone can accurately recognize the semantics of the voice command 1, the mobile phone performs the corresponding operation, and/or controls other devices to perform the corresponding operation.
  • the voice assistant cannot recognize the semantics of the voice command 1, such as the user intention corresponding to the voice command 1, the mobile phone cannot perform the corresponding operation.
  • the voice assistant cannot recognize the semantics of the voice command 1, it can give the user a reminder, such as a mobile phone voice prompting the user "I don't know what you mean”.
  • the voice assistant if the voice assistant cannot recognize the semantics of the voice command 1, such as identifying the real user intent A corresponding to the voice command 1 as intent B, the voice assistant outputs an execution command C according to the intent B, and the mobile phone responds to the execution Instruction C performs an operation. According to the operations performed by the mobile phone, the user can understand that the voice assistant misrecognized the semantics of the voice command 1.
  • the first voice command may be a voice command input by the user any time during the voice interaction process.
  • voice commands for the user to interact with the mobile phone.
  • voice assistants may not be able to effectively recognize the semantics of voice command 1:
  • Scenario 1 The content of the user's voice command is colloquial or personalized.
  • users may not use written sentences or standard voice commands, such as "play music”, and the voice commands input by users may be such as "Let's sing a song” or "Music walk up”.
  • Scenario 2 The keywords/objects of the user's voice command are unclear.
  • users may not use complete or standardized keyword descriptions in voice commands.
  • voice assistants can recognize the voice command "I want to watch “Harry Potter 2", while users may prefer to use abbreviated or popular keyword descriptions, such as "I want to watch Harry Potter 2".
  • Scenario 3 The user's voice command is vague or ambiguous.
  • the voice command sent by the user may have problems with unclear meaning, such as "I want to watch a movie where actor A is a chef", in fact the standard voice command is “I want to watch “Movie B” (starring actor A)”.
  • the voice command sent by the user has a dialect or accent, such as "call the old man” (Sichuan dialect), but in fact the standard voice command that the voice assistant can recognize is "call dad” (Mandarin).
  • the voice assistant cannot effectively recognize the semantics of the voice instruction 1 is not limited to the above example situation.
  • the voice assistant may The semantics of the voice command 1 sent by the user cannot be recognized, and the user intention corresponding to the voice command 1 sent by the user cannot be recognized. That is, the voice command set that can be recognized by the voice assistant does not cover the voice commands in the above situation.
  • the voice commands in the above situation are voice commands that the voice assistant cannot effectively recognize.
  • the voice instruction set that can be effectively recognized by the voice assistant does not cover voice instructions that support non-standard sentence patterns, non-standard keywords, or ambiguity in the above examples. It may be that the ASR module and/or NLU module of the voice assistant cannot effectively recognize the above non-standard Standard voice commands.
  • Deep learning models such as Recurrent Neural Network (RNN), Long Short Term Memory (LSTM), and Transformer can be applied to the ASR module and NLU module of the voice assistant shown in Figure 3.
  • RNN Recurrent Neural Network
  • LSTM Long Short Term Memory
  • Transformer can be applied to the ASR module and NLU module of the voice assistant shown in Figure 3.
  • pre-set standard voice commands are usually used, such as "play music", "please turn on the light”, etc. That is, the ASR module and NLU module of the voice assistant have their internal speech recognition algorithm or the voice command set that the model can support after training.
  • the voice assistant can effectively recognize the commands in the voice command set, which can be called standard voice commands. That is, the standard voice command is a voice command that can be directly and effectively recognized by the voice assistant corresponding to the user's intention.
  • the voice assistant may not be able to effectively recognize the semantics of the first voice command.
  • the first voice instruction is not limited to the above three situations, and this embodiment of the present application does not specifically limit it.
  • the first voice instruction is "Biao Gegeba" as an example.
  • the voice assistant may display text content corresponding to the first voice instruction on the dialogue interface 601 .
  • step S301 and step S302 may be combined into one step, and the voice signal input by the user may be a voice signal beginning with a voice wake-up word.
  • a voice wake-up word For example, "Xiaoyi, Xiaoyi, let's sing a song”, "Xiaoyi, Xiaoyi, please share the screen in the conference room", "Xiaoyi, Xiaoyi, I want to end the meeting”.
  • the voice signal behind the wake-up word is a voice command input by the user, such as "Let's sing a song", "Please share the screen in the meeting room” and "I want to end the meeting” are voice commands sent by the user to the voice assistant.
  • the voice assistant detects the wake-up word, it receives the voice command, and the dialogue interface of the voice assistant is displayed on the mobile phone, as shown in FIG. bar" text content.
  • Step S303 the mobile phone recognizes the first user's intention corresponding to the first voice command.
  • the mobile phone may use the voice interaction function to identify the first user intent corresponding to the first voice command, for example, the recognition of the first user intent corresponding to the first voice command may be completed by a voice assistant.
  • the microphone on the mobile phone forwards the user's voice signal (first language instruction) collected to the front-end processing module 21 of the voice assistant, and the voice signal is preprocessed by the front-end processing module 21 to obtain a preprocessed voice signal.
  • the speech signal of is input to ASR module 22.
  • the ASR module 22 converts the preprocessed speech signal into corresponding text to obtain the first text.
  • the first text may also be the text obtained after the voice assistant performs text processing on the converted text, such as text normalization, error correction, writing processing, and the like.
  • the first text is input to the NLU module 23 .
  • the NLU module 23 recognizes the semantics of the first text, performs word segmentation, part-of-speech tagging, keyword extraction and other processing operations on the first text, and extracts the first user intention corresponding to the first voice instruction.
  • the specific implementation manner thereof may refer to FIG. 2 , which will not be repeated here.
  • the voice assistant may not be able to recognize the first user intention corresponding to the first voice command, for example, the first voice command "Biaogegeba" corresponds to The first user intent is "play music", but the voice assistant may not recognize it. Or, the voice assistant cannot correctly recognize that the first user's intention corresponding to the first voice instruction is "play music”.
  • the first user's intention corresponding to the first voice instruction "I want to watch Ukraine 2" is "Open “Harry Potter 2", but the voice assistant may recognize it as other videos.
  • Step S304 When the mobile phone cannot recognize the first user's intention corresponding to the first voice command, the mobile phone outputs an unidentifiable feedback.
  • the ASR module 22 of the voice assistant fails to recognize the first voice command, or the NLU module 23 fails to recognize the first voice command, the voice assistant cannot recognize the semantics of the first voice command, and the voice assistant The first user intention corresponding to the first voice instruction cannot be understood.
  • the voice assistant outputs unidentifiable feedback to the user through the mobile phone, so as to express to the user the fact that the voice assistant cannot understand or recognize the first user's intention corresponding to the first voice command.
  • the unidentifiable feedback can be displayed on the dialogue interface in the form of text, as shown in FIG. text content.
  • the mobile phone may output unrecognizable feedback to the user in the form of voice, such as outputting a voice of "I don't know what you mean” to the user.
  • the unidentifiable feedback may also be "I don't understand you” or "Xiaoyi can't understand you", etc., which is not specifically limited in the embodiment of the present application.
  • step S304 may be: when the mobile phone cannot recognize the first user's intention corresponding to the first voice command, the mobile phone outputs guidance information for guiding the user to continue subsequent interactions.
  • the voice assistant when the voice assistant cannot recognize the semantics corresponding to the first voice command, the voice assistant can also output guidance information for guiding the user to continue subsequent interactions, and the mobile phone outputs the guidance information to the user.
  • the guidance information can be information requesting the user to continue to interact, such as "please speak again”, “please speak slowly” or “please speak again in Mandarin", or it can be a question about the voice command sent by the user. , such as "what did you just mean” or "what did you just say”.
  • This embodiment of the present application does not specifically limit it.
  • the guide information may also be displayed on the dialog interface in text form.
  • the mobile phone displays the text content of the guide message "Please say it again" on the dialog interface 702 .
  • the mobile phone may output guide information to the user in the form of voice, for example, the mobile phone outputs the voice of "please say it again” to the user.
  • step S304 may be: when the mobile phone cannot recognize the first user's intention corresponding to the first voice command, the mobile phone outputs an unidentifiable feedback and guidance information for guiding the user to continue subsequent interactions.
  • the unidentifiable feedback and guidance information may be output in the above-mentioned text form or voice form. If the content of unrecognizable feedback and guidance information can be output in text form at the same time, or the content of unrecognizable feedback and guidance information can be output in voice form at the same time, or the unidentifiable feedback and guidance information can be output in different forms, the embodiments of the present application are This is not specifically limited.
  • the mobile phone displays on the dialog interface 703 the text content "I don't know what you mean, please say it again.” which cannot recognize the feedback and guidance information.
  • the mobile phone can also output unidentifiable feedback or guidance information in other forms, such as mobile phone vibration or mobile phone vibrating at different vibration frequencies, and indicator lights.
  • Different electronic devices can also have different indication forms, for example, a smart speaker can control the LED lights to indicate, or the frequency of the lights to indicate. This embodiment of the present application does not specifically limit it.
  • step S304 may be: the mobile phone performs an operation according to the recognized user intention of the first voice command.
  • the first user intention corresponding to the first voice command recognized by the mobile phone in step S303 is wrong, and the mobile phone cannot correctly recognize the first user intention corresponding to the first voice command, and the first voice command corresponding to the first user intention If a user intention is identified as other wrong intentions, then step S304 the mobile phone performs wrong operations according to the wrong intentions.
  • the user inputs the first voice command "Bang a song", and the voice assistant recognizes the user's intention of the first voice command as "turn on the light”.
  • the dialogue interface 704 displays the feedback after the operation performed by the voice assistant, such as "OK, the light has been turned on”. The user can learn that the voice assistant misidentified its true intention through the feedback after the mobile phone performs the operation.
  • Step S305 the user sends a second voice command to the mobile phone.
  • the voice assistant in the prior art cannot recognize the first voice command output by the user, it will end the processing of the first voice command and terminate the interaction process.
  • the voice interaction function (voice assistant) of the mobile phone cannot recognize the user's voice commands, the conversation will end.
  • the voice assistant in the prior art misrecognizes the user's intention corresponding to the first voice command, the user cannot correct the real user's intention recognition of the first voice command by the voice assistant.
  • the voice assistant that executes the voice interaction method provided by the embodiment of the present application can provide an explanation function: when the voice assistant cannot recognize or correctly recognize the first user intention corresponding to the first voice command output by the user, the voice assistant continues to use the sound pickup device (such as a microphone) collects the voice of the user to receive a second voice instruction input by the user for explaining or repeating the first voice instruction.
  • the voice assistant that executes the voice interaction method provided by the embodiment of the present application provides an explanation process, and the user can continue to use voice commands to perform voice interaction with the voice assistant, so as to express the content of the previous item to the voice assistant again through another language expression content.
  • the content of a voice command that is, the user expresses or explains the semantics of the last first voice command to the voice assistant in another way, so that the voice assistant can understand the user's intention corresponding to the first voice command. Then the voice assistant can know the user intention corresponding to the first voice instruction according to the voice instruction input by the user, or correct the recognition result of the user intention corresponding to the first voice instruction.
  • the voice assistant when the voice assistant cannot recognize the first voice command output by the user, the user can send the second voice command to interpret the first voice command through the second voice command, so that the voice assistant can effectively execute the first voice command.
  • the response corresponding to the command enriches the voice interaction between the user and the electronic device.
  • the user inputs a second voice command "play music", wherein the second voice command "play music” is a voice command that can be effectively recognized by the voice assistant.
  • the voice assistant when the voice assistant cannot recognize the first voice instruction, that is, the voice assistant cannot generate the recognition result of the first voice instruction (such as the user's intention), the voice assistant can establish a A learning session, so that the second voice instruction input by the user can be continuously received during the learning session.
  • the voice assistant after the voice assistant generates the recognition result of the first voice command, the voice assistant can also establish a learning session associated with the first voice command, so that during the learning session, the voice assistant can continue to receive input from the user. Voice commands.
  • the voice assistant can continue to receive the second voice command input by the user, and detect that the second voice command is used to explain the first voice command , if the second voice command is a voice command sent in the following second or third way, the voice assistant can choose to establish a learning session associated with the first voice command, or not establish a learning session associated with the first voice command. A learning session associated with voice commands.
  • the way for the user to send the second voice instruction may include the following:
  • the user directly sends the second voice command to the mobile phone.
  • the first method is used in a scenario where the voice assistant or the mobile phone cannot recognize the first user's intention corresponding to the first voice command. That is, when the voice assistant cannot recognize the semantics of the first voice command, the mobile phone receives the second voice command for explaining or repeating the first voice command.
  • the second voice command received by the voice assistant from the user is used to explain or repeat the first voice command; or, when the voice assistant cannot recognize the first voice command, the second voice command received within the preset time for explaining or repeating the first voice command; etc.
  • step S303 the voice assistant cannot recognize the first user's intention corresponding to the first voice command
  • step S304 after the voice assistant outputs unrecognizable feedback or guidance information or unrecognizable feedback and guidance information through the mobile phone, the user understands that the voice
  • the assistant cannot recognize the user's intention corresponding to the first voice command, and the user continues to send the second voice command such as "play music" to the mobile phone.
  • the mobile phone forwards the received second voice command to the voice assistant, and the voice assistant defaults that the second voice command "play music” is a retelling of the previous unrecognizable first voice command "Biao Gegeba", or a response to the previous voice command An explanation of the unrecognizable first voice command "Biao Gege”.
  • the user can use the preset template to send the second voice command to the mobile phone, that is, the voice input by the user includes the preset template and the second language command.
  • the preset template is used to represent the second voice instruction currently sent using the preset template, and is used for retelling or explaining the last first voice instruction.
  • the voice assistant detects the preset template, the voice assistant considers that the second voice instruction sent by using the preset template is a voice instruction for explaining or repeating the last first voice instruction.
  • the preset template may be in the form of an instruction including explanatory content, and the embodiment of the present invention does not specifically limit the format of the preset template.
  • the preset template may be a fixed sentence pattern, such as "what I mean”, “what I meant in the last sentence” or “what I just said is” and so on. That is, when the user uses the second method to send the second voice command to the mobile phone, the input voice is “I mean to play music”, “I mean to play music in the last sentence” or “What I just said is to play Music” etc.
  • the preset template can also be a preset word, such as “interpretation”, “retelling” or “amendment”. That is, when the user sends the second voice instruction to the mobile phone, the input voice is “explain, play music", “retell, play music” or “correct, play music”, etc.
  • the user may send a second voice command to the mobile phone after or when the trigger operation is performed.
  • the trigger instruction corresponding to the trigger operation is used to indicate that the second voice instruction received by the mobile phone is for repeating or explaining the last first voice instruction.
  • the triggering operation can be a triggering operation on a UI virtual button, and the UI virtual button can be displayed on the dialogue interface of the voice assistant. Display the UI virtual key.
  • the user can continue to input voice commands to the mobile phone for retelling or explaining the last voice command. Taking the UI virtual button as an example, the user clicks the UI virtual button displayed on the mobile phone and at the same time inputs the voice command "play music".
  • the voice assistant detects that the user clicks the trigger operation of the UI virtual button on the dialogue interface, and the voice assistant receives the trigger operation. After operating the corresponding trigger command, the voice command "play music" received by the mobile phone is used as a retelling of the first voice command "Let's sing a song", or as an interpretation of the first voice command "Let's sing a song”.
  • the triggering operation may also be a triggering operation on a physical button, and the physical button may be a home button, a power button, a car voice button, or a remote control button on a smart screen.
  • the triggering operation may also be a preset gesture or the like. This embodiment of the present application does not specifically limit it.
  • VUI Voice User Interface
  • the voice assistant when the voice assistant cannot recognize the first voice command, the voice assistant defaults that the voice command received after the first voice command is used for explaining or repeating the first voice command.
  • the user can directly explain with the voice assistant, without needing to use the preset template to send the second voice command as in the second method, and also not needing to trigger the operation as in the third method, saving the user's operation process and the interaction between the user and the mobile phone Smarter and more humane.
  • the above-mentioned second method can be used when the voice assistant cannot recognize the first user intention corresponding to the first voice command, that is, when the voice assistant cannot recognize and output the first user intention corresponding to the first voice command, or the voice assistant cannot correctly identify
  • the first user intends to use the first voice assistant. For example, after the user learns that the voice assistant cannot recognize and output the first user intention corresponding to the first voice command, and the mobile phone outputs the feedback that it cannot recognize.
  • the user sends the second voice command to the mobile phone using a preset template, such as inputting the voice "I mean play music" to the mobile phone.
  • the voice assistant detects that the voice input includes the preset template "I mean”
  • the second voice command "play music” is used as a retelling of the first voice command “Biao Geba”, or as a response to the first voice command. Explanation of the command "Let's sing a song”.
  • the voice assistant recognizes the user intention "play music” of the first voice command "Biaogegeba” as “Turn on the light”
  • the user uses a preset template to send a second voice command to the mobile phone, such as inputting the voice "I mean play music" to the mobile phone.
  • the voice assistant detects that the voice input includes the preset template "I mean”, and the voice assistant recognizes that the currently input voice is used to explain the previous first voice command "Let's sing a song", and the second voice The command "play music” is used as an explanation to the first voice command "play a song”. Then the voice assistant can modify the user intention of "Let's sing a song” from “Turn on the light” to "Let's sing a song”.
  • the user can interpret or retell the first voice command according to his own needs, and is not limited by the guidance process of the voice assistant. That is, in the second manner, it is not limited that when the voice assistant cannot recognize the first voice command, the voice command received after the first voice command is assumed to be used for explaining or repeating the first voice command.
  • the user can input a voice command that has nothing to do with the first voice command, and the voice assistant will not interpret the irrelevant voice command as an interpretation of the first voice command, only when the voice assistant detects that the user has adopted a preset Only when the template is set, the voice command using the preset template is considered to be a retelling or interpretation of the last unrecognized voice command.
  • the second method can also correct the recognition of the voice assistant when it is learned that the recognition error of the voice assistant is wrong.
  • the second method can complete the interpretation or retelling of the first voice command through voice interaction, thereby improving user experience.
  • the above-mentioned third method is similar to the second method, and can be used when the voice assistant cannot recognize the first user intent corresponding to the output first voice command, or when the voice assistant cannot correctly recognize the first user intent corresponding to the first voice assistant,
  • the difference between the above-mentioned third method and the second method is that it can provide users with an additional interactive experience. Users can perform trigger operations on physical buttons, UI virtual buttons, or use preset gestures, and then they can trigger operations or After the trigger operation, the second voice command is sent, and after the voice assistant or mobile phone detects the trigger command corresponding to the trigger operation, the second voice command is used as a retelling or interpretation of the previous first voice command.
  • the embodiment of the present application is not limited to whether there is a single round of voice interaction or multiple rounds of voice interaction.
  • the voice assistant uses the voice interaction method of the embodiment of the present application.
  • the voice assistant continues to use the sound pickup device (such as a microphone) to collect the user's voice.
  • the single-round interaction is converted into multiple rounds of interaction.
  • the voice command 2 may not be a standard voice command, and the voice assistant still cannot recognize the voice command 2 or If the voice command 2 cannot be recognized correctly, a preset interaction times or duration threshold can be set. If the voice assistant cannot recognize the voice command 2 that the user continues to input within the preset interaction times or duration threshold, the voice assistant will The data involved (including but not limited to voice command data, program logs, etc.) is uploaded to the cloud server for manual identification, and the voice command 2 is manually associated with the first voice command.
  • a voice command input by the user is received, or a voice command sent by the user using a preset template is received, or a trigger command is detected, continue with the following steps, If the preset time range is exceeded, the voice command input by the user is not received, or the voice command sent by the user using the preset template is not received, or the trigger command is not detected, the interaction process ends.
  • Step S306 the mobile phone recognizes the second voice command, and executes the first operation corresponding to the second voice command.
  • the mobile phone voice assistant can recognize the second user intention corresponding to the second voice instruction, and execute the first operation corresponding to the second intention according to the second user intention.
  • the mobile phone may use the voice interaction function to identify the second user intent corresponding to the second voice command, for example, the voice assistant recognizes the second user intent corresponding to the second voice command.
  • the voice interaction function for example, the voice assistant recognizes the second user intent corresponding to the second voice command.
  • the voice input received by the mobile phone includes the second voice command such as "play music” and the preset template such as "I mean", the voice The assistant recognizes the preset template "I mean” in the voice input, recognizes the content of the voice input except the preset template "I mean” as the second voice command, and obtains the second voice command "play music", For a specific implementation manner of identifying the second user's intention corresponding to the second voice instruction "play music", reference may be made to step S303, which will not be repeated here.
  • the user uses the first method to send the second voice command
  • the user directly inputs the second voice command "play music” to the mobile phone
  • the voice assistant receives the second voice command "play music", as shown in Figure 7(a) , displaying the text content of the second voice instruction on the dialogue interface 801 .
  • a UI virtual button 804 is displayed on the dialogue interface 803 presented on the mobile phone, and the user clicks the UI virtual button 804, and enters the Two voice commands "play music".
  • the mobile phone displays the text content of the second voice command on the dialogue interface 803 .
  • the mobile phone recognizes that the second user's intention corresponding to the second voice command "play music” is "play music” (for example, Intent: Play Music).
  • the first operation performed by the mobile phone is a music playback operation, such as opening a music playback App or a music playback service, and playing songs to the user.
  • the mobile phone may determine recommended songs according to preset recommendation rules, and then play them to the user. For example, the mobile phone can use the most played songs in the last 7 days as recommended songs based on the user's historical playback records. The mobile phone automatically plays the determined recommended song in response to the execution instruction of playing music, and displays it on the dialogue interface.
  • the voice assistant recognizes the second user's intention of the second voice command, and the song is played on the mobile phone, as shown in Figure 8, and the mobile phone outputs feedback, such as broadcasting " OK, start playing music", and the text content of the response sentence for the user's voice command and the music control 902 are displayed on the dialogue interface 901 .
  • the song being played displayed in the music control 902 is the song being played by the mobile phone.
  • Step S307 The mobile phone establishes an association relationship between the first voice command and the second voice command.
  • the association relationship between the first voice command and the second voice command may be stored locally on the mobile phone, or may be stored in a cloud server, which is not specifically limited in the embodiment of the present application.
  • the embodiment of the present application does not specifically limit the form of the association relationship.
  • step S307 can be completed by the voice assistant on the mobile phone.
  • the voice assistant detects the second voice instruction for retelling or explaining the unrecognized first voice instruction, and establishes an association relationship between the first voice instruction and the second voice instruction. In other words, when the voice assistant detects that the user sends the second voice command in any of the above three ways, an association relationship between the first voice command and the second voice command is established.
  • step S307 is executed after the user executes step S305.
  • the voice assistant may execute step S307 at any time. That is, step S307 can be performed before, after or simultaneously with step S306, and step S307 can also be performed after the end of the current voice interaction process, that is, the voice assistant is allowed to perform step S307 offline.
  • the voice assistant may execute step S307 when the voice assistant is out of operation, or the mobile phone is turned off for charging.
  • the voice assistant detects the second voice instruction used to repeat or explain the unrecognized first voice instruction, which may include the following methods:
  • the voice assistant defaults that when the voice assistant cannot recognize the first user intention corresponding to the first voice command, the voice command received by the voice assistant is the second voice command for repeating or explaining the unrecognized first voice command. That is, by default, after the user outputs unrecognizable feedback, guidance information, or unrecognizable feedback and guidance information on the mobile phone, the voice command output by the user to the mobile phone is the second voice command for repeating or explaining the unrecognized first voice command.
  • the voice input includes a second voice instruction for repeating or explaining the unrecognized first voice instruction. That is, during the voice interaction process, the voice assistant detects whether the voice input entered by the user has a preset template. If so, the voice assistant detects whether the voice input includes voice instructions other than the preset template.
  • the instruction includes a second voice instruction for retelling or explaining the unrecognized first voice instruction.
  • the voice command received by the mobile phone is the second voice command. That is, near the moment when the trigger instruction is generated (for example, when or after the trigger instruction is generated), the voice instruction received by the voice assistant is used as the second voice instruction.
  • establishing the association relationship between the first voice command and the second voice command can be understood as mapping the first voice command to the second voice command, that is, it is considered that the first user intention corresponding to the first voice command is related to the second voice command.
  • the intentions of the second user corresponding to the two voice commands are similar or consistent.
  • the second user intention corresponding to the second voice instruction may be used as the first user intention corresponding to the first voice instruction.
  • establishing an association relationship between the first voice instruction and the second voice instruction may be equating the first voice instruction to the second voice instruction. After the first voice command is equated with the second voice command, when a third voice command with the same definition or pronunciation as the first voice command is subsequently received, the third voice command is replaced by the second voice command.
  • the speech instruction is recognized and processed to obtain the second user intention, and the second user intention is output as the user intention corresponding to the third speech instruction.
  • the association relationship between the first voice instruction and the second voice instruction can be established, and the second user intention of the second voice instruction can be associated with the first voice instruction, that is, the second user intention can be used as the first voice instruction.
  • User intent for voice commands After associating the second user intention of the second voice command with the first voice command, and subsequently receiving a third voice command with the same definition or pronunciation as the first voice command, there is no need to recognize the third voice command, and directly Get the second user intent and output it.
  • an association table can be obtained.
  • the voice assistant can expand its intention understanding ability according to the association relationship (or association table), so that the voice assistant can recognize the first user corresponding to the first voice command instead of the first user's intention corresponding to the first voice command. intention. That is, when the user adopts non-standard sentence patterns or non-standard keywords that are not covered by the standard voice command set of the ASR module and/or NLU module of the voice assistant, or voice commands with ambiguity, the non-standard voice commands are used to communicate with the mobile phone.
  • the voice assistant or mobile phone using the voice interaction method of the embodiment of the present application can guide the user to understand the first voice by outputting unrecognizable feedback or guidance information.
  • the voice assistant or mobile phone using the voice interaction method of the embodiment of the present application can guide the user to understand the first voice by outputting unrecognizable feedback or guidance information.
  • Interpret or retell the instruction receive the second voice instruction input by the user for retelling or explaining the unrecognized first voice instruction, establish the association relationship between the first voice instruction and the second voice instruction, and the voice assistant learns by itself according to the association relationship Or self-update its model or association table, and expand the voice assistant's ability to support non-standard sentence patterns or non-standard keywords, or the intention understanding ability of voice commands with ambiguity.
  • the voice assistant provides an explanation function, and the user uses the interpretation function to interact with it through standard voice commands, guiding the voice assistant to expand its ability to understand intent, so that the voice assistant can quickly support user-defined non-standard sentence patterns/non-standard keys Words, and/or non-standard voice commands with ambiguous commands, solve the problem that voice assistants cannot recognize non-standard voice commands, colloquial and personalized voice commands, and enrich the voice interaction between users and electronic devices.
  • the following will describe in detail how to improve the voice assistant’s recognition of the first voice command based on the relationship between the first voice command and the second voice command, so as to expand the voice assistant’s ability to understand the intent and support the above-mentioned non-standard sentence patterns or non-standard keywords , or non-standard voice commands with ambiguity.
  • Step S308 the user sends a third voice command to the mobile phone.
  • the voice assistant After the voice assistant establishes the association relationship between the first voice command and the second voice command, and expands the recognition capability of the voice assistant, the voice assistant can effectively recognize the first voice command.
  • the user can use a third voice command with the same or similar voice content as the first voice command, such as "Biaogegeba" to interact with the mobile phone.
  • the user sends the first voice command to the mobile phone for the Nth time.
  • the voice command "Let's sing a song", wherein, N is an integer greater than 1.
  • the user when the user sends the third voice command to the mobile phone, the user inputs the voice "Biao Geba” to the mobile phone, regardless of whether the ASR module recognizes the voice “Biao Geba” as “Biao Geba” or “Biao Geba” Let's all sing together", based on the error correction function in the ASR module, the final output text will also be corrected to "Let's sing together".
  • the user inputs the voice "Biao each bar” to the mobile phone, and the ASR module recognizes the text corresponding to the voice as “Biao each bar", and based on the error correction function in the ASR module, the final output text will also be corrected to "Biao Let's have a song.”
  • the third voice instruction is a voice instruction similar to the voice content of the first voice instruction.
  • the voice assistant regards the third voice command as a voice command strongly related to the first voice command in the voice recognition result, and then associates the third voice command with the first voice command, thus it can be considered that the third voice command corresponding to the third voice command is
  • the user intention is the first user intention corresponding to the first voice instruction.
  • Step S309 the mobile phone executes the first operation according to the third voice command.
  • the voice assistant when the voice assistant recognizes that the third voice command is based on the same voice command as the first voice command in the voice recognition result, such as the recognized content of the third voice command and the first voice command Or the pronunciation is the same, and based on the association between the first voice command and the second voice command, the mobile phone performs the same first operation as the response to the second voice command.
  • the voice assistant recognizes that the third voice command is based on the same voice command as the first voice command in the voice recognition result, such as the recognized content of the third voice command and the first voice command Or the pronunciation is the same, and based on the association between the first voice command and the second voice command, the mobile phone performs the same first operation as the response to the second voice command.
  • step S306 for specific content, reference may be made to step S306, which will not be repeated here.
  • the NLU module when the NLU module recognizes the user's intent corresponding to the voice command, the NLU module is designed to have certain robustness or error correction capabilities, even if the text transmitted by the ASR module received by the NLU module is slightly different from the standard text, for example
  • the NLU module receives the text "play a piece of music” instead of the standard text "play a piece of music”, and it can also correctly identify the corresponding user intent as "play a piece of music”.
  • the NLU module receives the text "Biao Song Bar” instead of the standard text "Biao Ge Bar”, and it can also correctly recognize that the corresponding user intention is "play music”.
  • the different speech instructions within the scope of NLU robustness or error correction capability in the example here belong to basically the same speech instruction. It can be understood that two voice commands whose content or pronunciation are not exactly the same may belong to basically the same voice commands.
  • the voice interaction method further includes: receiving a fourth voice instruction, where the content or pronunciation of the fourth voice instruction is not exactly the same as that of the first voice instruction, and the content or pronunciation of the fourth voice instruction is different from that of the first voice instruction.
  • the similarity of content or pronunciation is within the first range, and in response to the fourth voice command, perform the same first response as the second voice command.
  • the fourth voice instruction may be a voice instruction that is determined to be substantially the same as the first voice instruction based on the robustness of the voice assistant, or may be a voice instruction that is determined by the voice assistant to be within the first range of similarity with the first voice instruction .
  • the fourth voice instruction "Biaogeba” may be recognized as substantially the same voice instruction as the first voice instruction "Biaogeba”.
  • the voice assistant determines that the similarity between the fourth voice command "Biao Geba" and the first voice command "Biao Geba" is 95%, and the first range is 90% to 99%.
  • a third voice instruction or a fourth voice instruction may be received.
  • the third voice command and the fourth voice command can be received, such as receiving the third voice command before receiving the fourth voice command, or receiving the fourth voice command first.
  • the fourth voice command receives the third voice command.
  • the voice assistant detects the second voice command, it establishes the association between the first voice command and the second voice command, and expands its ability to understand the intention of the first voice command.
  • the interaction process of voice command interaction continue to take the interaction between the third voice command "Biaogegeba" and the mobile phone as an example.
  • the mobile phone executes the first operation "play music” in step S306, the mobile phone outputs feedback.
  • the user continues to send the third voice command "Biao Gege" to the mobile phone, as shown in Figure 9(a), the mobile phone displays on the dialog interface 101 the corresponding voice text content and the answer sentence for the user's voice when the user outputs the third voice command
  • the text "Okay, start playing music", and the music control 102.
  • the song being played by the mobile phone is displayed in the music control 102 .
  • the sending of the third voice command by the user is not limited to the interaction process in which the user uses the first voice command for the first time to interact, but may also be after the interaction process ends.
  • the user uses the first voice command "Biao Gegeba" to interact with the mobile phone, and the voice assistant or the mobile phone has expanded the ability of the voice assistant to understand the intent of the first voice command based on the above steps.
  • the user inputs the voice "Xiaoyi, Xiaoyi, let's sing a song" to the mobile phone, the user wakes up the voice assistant on the mobile phone, and the voice assistant recognizes the third voice command "Let's sing a song” Semantics, according to the association relationship between the first voice instruction and the second voice instruction, and then according to the user intention of the third voice instruction to perform the first operation.
  • the mobile phone displays the dialogue history on the dialogue interface 103, the text content of the currently input third voice command "Biao Gege", and the text content of the voice assistant's answer sentence to the user's voice "OK, start playing Music", and music controls 104. At this moment, the song being played by the mobile phone is displayed in the music control 104 .
  • the user after the first voice command is associated with the second voice command, the user will use the first voice command to interact with the mobile phone next time, not limited to the use of voice, but also in the form of text or voice Interact with the phone. For example, if the voice assistant associates "play music” with “Music walking”, the next time the user operates the mobile phone and sends the text content "Music walking” to the voice assistant on the dialogue interface of the voice assistant, the voice assistant can recognize the text The user intent corresponding to the content "Music walking" is "play music”.
  • the voice interaction method provided in the embodiment of the present application may be implemented on the above dialogue interface, or may be implemented on the setting interface.
  • An electronic device that implements the voice interaction method provided in the embodiment of the present application may provide a setting interface, and the setting interface may be used for a user to set voice commands. The user can perform voice command association settings on this setting interface. If the user inputs a first voice command to the setting interface, and subsequently inputs a second voice command to the setting interface, the voice assistant associates the first voice command with the second voice command.
  • the input of the first voice command or the second voice command on the setting interface may be through voice input or text input, which is not specifically limited in this application.
  • the voice interaction method provided in the embodiment of the present application is not limited to the scenario where the voice assistant fails to recognize the voice instruction.
  • the voice interaction method provided by the embodiment of this application can also be applied to various scenarios according to the user's personal needs, such as:
  • Scenario 1 The voice interaction method provided by the embodiment of the present application is applied to the setting of personalized voice commands for voice assistants.
  • Users can adjust the voice assistant's semantic recognition of voice commands according to their personal language habits or needs. For example, the user is accustomed to using “Music Walk Up”, but the voice assistant cannot recognize that the user's intention corresponding to "Music Walk Up” is “play music”, then the user can actively use the second voice command "play music” to explain “Music Walk Up” . Perform the above steps on the mobile phone to associate "Play Music" with "Music Walk”. Then the next time the user interacts with the mobile phone with the command "Music walk up” (voice or text form), the voice assistant can recognize that the user's intention corresponding to "Music walk up” is "play music”.
  • Scenario 2 Apply the voice interaction method provided by the embodiment of the present application to the setting of special voice commands for special groups of people.
  • Special groups such as foreigners, the elderly or children cannot output standard voice commands.
  • a child may say "play music” as “play music”, then after the child outputs the first voice command "play music” to the mobile phone, the adult can send the second voice command "play music” to the mobile phone.
  • the voice assistant can associate the first voice command "wave playing music” with the second voice command "play music”.
  • the voice assistant will obtain the correspondence of the first voice command "Bo Fang Le” according to the association between the first voice command "Bo Fang Le” and the second voice command "Play Music”.
  • the user intention is the user intention corresponding to the second voice instruction "play music", and the mobile phone plays music.
  • the above-mentioned scene 1 and scene 2 can be realized during the dialogue process, and can also be realized on a specific dialogue interface or setting interface, which is not specifically limited in this embodiment of the present application.
  • the expansion of the voice assistant's intention understanding ability does not involve adding custom semantics through cumbersome UI interface operations, nor does it involve designing/calling the extended semantic interface through API (Application Programming Interface), but directly through human Machine voice interaction is completed.
  • API Application Programming Interface
  • the voice interaction method provided by the embodiment of the present application is not limited to the voice interaction of a single electronic device. After the first electronic device associates the first voice command with the second voice command, the first electronic device can synchronize the association relationship to other electronic equipment.
  • the association relationship between the standard voice instruction and the non-standard voice instruction may be an equivalent relationship between the standard voice instruction and the non-standard voice instruction.
  • the equivalence relationship between standard voice commands and non-standard voice commands may be shown in Table 1 below.
  • non-standard voice commands standard voice commands sing a song play music Movie walk up play movie Open PYQ open circle of friends
  • the association relationship between the standard voice instruction and the non-standard voice instruction may be the association relationship between the recognition result corresponding to the standard voice instruction (such as the user's intention) and the non-standard voice instruction.
  • the association relationship between the recognition result corresponding to the standard voice command (such as the user's intention) and the non-standard voice command can be shown in Table 2 below.
  • the voice assistant updates the association table according to the association relationship between the standard voice instruction and the non-standard voice instruction, and improves the recognition of the non-standard voice instruction by the voice assistant according to the updated association table.
  • a database may be constructed to store association tables.
  • the association table is maintained by the voice assistant, and the association table describes the association between non-standard voice instructions (including non-standard sentence patterns/keywords, and/or ambiguous instructions) and standard voice instructions, that is, non-standard voice instructions and standard instructions Equivalent mapping relationship between them.
  • the association table may be Table 1 above.
  • the association table is mounted in the ASR module 22 of the voice assistant, and the ASR module 22 updates and uses the association table.
  • the voice assistant updates the non-standard voice command and the standard voice command according to the relationship between the first voice command and the second voice command.
  • the association table of instructions that is, fill in the first voice command "Biao Geba” in the non-standard voice command in the association table, and fill in the first voice command "Biao Geba” in the standard voice command place of the association table
  • the voice assistant After the voice assistant finishes updating the association table, the user sends the first voice command "Biao Song Bar” to the electronic device again.
  • the electronic device receives the first voice command "Biao Song Bar”
  • the ASR module 22 of the voice assistant recognizes the A voice instruction "Show a song”
  • the language model of the ASR module 22 outputs the first voice instruction text "Show a song”
  • the voice model can refer to the association table, according to the association table, the first voice
  • the instruction "Biaogegeba” is replaced by the associated second voice instruction "play music”
  • the second voice instruction text "play music” is output to the NLU module 23 .
  • the NLU module 23 processes the second voice instruction "play music”.
  • the subsequent processing flows of the NLU module 23 and the DM module of the voice assistant process the second voice command "play music" instead of The first voice command "Let's sing a song” is processed.
  • the above-mentioned association table can be mounted in the NLU module 23.
  • the NLU module 23 replaces the non-standard voice instruction with the corresponding standard voice instruction according to the association table.
  • the association table is directly translated from the ASR module 22 to the front end of the NLU module 23, and the non-standard voice command text that is about to be input to the NLU module 23 for processing is replaced with a standard voice command text.
  • the specific implementation method is the same as that of the ASR module 22, and will not repeat.
  • the above association table may also be associated with the DM module 24 .
  • use the association table as a skill, or place it in a specific skill. If the module of the voice assistant cannot recognize a certain voice command, the DM module will call the skill to confirm whether the voice command is associated with other voice commands, or Whether the voice instruction has been associated with a specific operation, if so, the DM module performs the operation according to the response corresponding to the associated voice instruction, or directly executes the matched associated specific operation.
  • voice instruction replacement processing flow may also be performed in other voice recognition algorithms or flows, which is not specifically limited in this embodiment of the present application.
  • the voice assistant can update the association table in the following ways:
  • the list of skills maintained by the voice assistant includes associated skills.
  • the association skill is used to associate the currently processed voice command with the previous voice command.
  • the user uses the second method to send the second voice instruction, that is, the user inputs the voice "I mean play music" to the electronic device, and the ASR module 22 converts the voice input into text "I mean play music” and input it to the NLU module 23 .
  • the NLU module 23 extracts the user intention from the text corresponding to the voice, extracts the user intention 1 (i.e. "play music") from the text "play music”, and extracts the user intention 2 (i.e. A voice command is interpreted), and the extracted intention data is delivered to the DM module 24.
  • the DM module 24 invokes the corresponding skills according to the user intention 1 and the user intention 2 respectively. That is, the music playback control skill is invoked according to user intent 1, and the associated skill is invoked according to user intent 2.
  • Music playback control skills and associated skills perform corresponding operations by invoking the corresponding service interfaces. For example, if the music playback control skill invokes the playback control service, the voice assistant outputs an execution command to the electronic device, and the electronic device plays music according to the execution command.
  • the associated skill calls the associated service, and the voice assistant records the first voice command "Let's sing a song” and the second voice command "Play music”, and fills in the non-standard voice command in the association table with the first voice command "Let's sing a song” , fill in the second voice command "play music” corresponding to the first voice command "Biaogegeba” at the standard voice command place of the association table, thereby realizing the standard voice command (the second voice command) and the non-standard voice command ( The association between the first voice command).
  • association table when the association table is the above table 2, the association table can be mounted on the NLU module 23, and when the NLU module 23 is working, the first voice instruction "Biao a song" is found from the association table.
  • the user's intention of "bar” is "play music”
  • the NLU module 23 outputs the first voice command "Biao Gegeba” and the user's intention is "play music”.
  • the ASR module 22 includes a detection module.
  • the detection module is used to detect whether the voice input by the user includes a preset template, and if it includes a preset template, extract a standard voice command for explaining or repeating an unrecognized voice command from the voice input, and transmit it to the NLU module 23 .
  • the detection module is also used for associating the extracted standard voice commands with non-standard voice commands.
  • step S305 of FIG. 3 above the user uses the second method to send the second voice instruction, that is, the user inputs the voice "I mean play music” to the electronic device, and the ASR module 22 can correctly convert the voice input For the text "I mean play music".
  • the language model feeds the text "I mean play music” to the detection module.
  • the detection module recognizes whether the current speech input includes the preset template "I mean” by detecting or matching the preset template "I mean”. If it is detected that there is a preset template in the voice input, then the detection module can determine that the current voice input involves retelling or explaining the first voice command "Biao Geba", and the detection module extracts from the text "I mean play music”.
  • the second voice command "play music” used to repeat or explain the first voice command is the text following the preset template "I mean" in the voice input text.
  • the detection module passes the extracted second voice instruction to the follow-up NLU module 23 and DM module 24, to identify the user's intention corresponding to the second voice instruction "play music”, invoke “music playback control skills", and the music playback control skills call playback control service, the voice assistant outputs an execution command to the electronic device, and the electronic device plays music according to the execution command.
  • the detection module updates the association table, fills in the first voice instruction "Biao Geba” in the non-standard voice instruction in the association table, and fills in the first voice instruction "Biao Geba” corresponding to the first voice instruction "Biao Geba” in the standard voice instruction of the association table
  • the second voice command is "play music", thereby realizing the association between the standard voice command (second voice command) and the non-standard voice command (first voice command).
  • the detection module can be a part of the ASR module 22 , can also be deployed outside the ASR module 22 , or can be placed behind the ASR module 22 , and the deployment location and form of the detection module are not limited.
  • the voice assistant can also associate the first voice command "Biao Geba” with the second voice command "play music" by default, which is non-standard in the association table. Fill in the voice command with the first voice command "Biao Ge Bar", and fill in the second voice command "play music” in the standard voice command place of the association table corresponding to the first voice command "Biao Ge Bar".
  • the electronic device receives the trigger command and sends the trigger command to the voice assistant, and the voice assistant associates the first voice command "Bang a song” according to the trigger command.
  • “ and the second voice command "play music” fill in the first voice command "Biao a song” in the non-standard voice command in the association table, and correspond to the first voice command "Biao a song” at the standard voice command place in the association table Song bar” to fill in the second voice command "play music”.
  • the voice assistant constructs training data according to the correlation between standard voice commands and non-standard voice commands, and trains the voice assistant according to the training data.
  • a database can be constructed to store an association table.
  • the association table can be the above table 2.
  • the training data can be compiled to train the ASR module 22 and/or the NLU module 23 of the voice assistant. , such as incremental learning training, so that the trained ASR module 22 and NLU module 23 can support the recognition of non-standard speech instructions. That is to add non-standard speech instructions to the training samples, and retrain the network model corresponding to the ASR module 22 and/or the NLU module 23 .
  • method 2 does not require accumulating a certain amount of data before training. It only needs to extract the user intent corresponding to the standard voice command associated with the non-standard voice command based on the relationship between the standard voice command and the non-standard voice command, and construct a training program.
  • the data includes non-standard voice commands and their corresponding user intentions, and adding them to the training data can complete the above-mentioned incremental learning.
  • the NLU module 23 is retrained or incrementally trained using the newly added training data. After the training is completed, the NLU module 23 can support the recognition of the non-standard voice command "Biao Gege". This can expand the voice assistant's ability to understand intent.
  • device-side learning technologies such as few-sample learning and incremental learning can be further used to mine user language habits and incrementally train one or more of the voice assistant ASR module 22 and NLU module 23. This further improves the voice assistant's ability to understand intent and interact naturally.
  • the GAN network can be used to batch generate user-defined voice commands with the same style as the recorded user-defined non-standard voice commands. Define non-standard voice commands.
  • a small amount of labeled data (including recorded user-defined non-standard voice commands and their corresponding equivalent standard voice commands) can be used to fine-tune the generation network and classification network to mine and learn user language habits. Then, input standard voice commands into the generation network, generate corresponding user-defined non-standard voice commands in batches, and obtain marked data pairs of standard voice commands and non-standard voice commands, so that they can cover various scenarios and conform to user voice habits . Finally, the generated labeled data is used for incremental learning of the NLU module.
  • the generation network can be constructed based on pre-trained models such as BERT and GPT-3, which is not specifically limited in this embodiment of the present application.
  • the voice assistant can also be trained using the few-sample learning technique.
  • the voice assistant can also be trained using the few-sample learning technique.
  • the voice assistant can further expand and support non-standard sentence patterns/keywords that the user has not "interpreted” or has no recorded history, and/or Ambiguous instructions. That is, it can not only support non-standard sentence patterns/keywords and/or ambiguous instructions that users have "interpreted", but also support non-standard sentence patterns that users have not used or "interpreted” by generating training data sets or few-sample learning /Keywords, and/or ambiguous instructions, make the voice assistant change from mechanically expanding support for user-defined non-standard voice instructions to digging and learning the hidden language habits in the user's voice instructions.
  • the user has not used the non-standard voice command "Open PYQ” to interact with the mobile phone.
  • the voice assistant After updating the NLU module 23 based on incremental learning and Generative Adversarial Networks (GAN) training, the voice assistant The voice command “Biaogege” taps out the user's language habits, and the voice assistant can recognize the user's intent corresponding to the non-standard voice command "Open PYQ" as "Open Moments".
  • GAN Generative Adversarial Networks
  • technologies such as federated learning and data crowdsourcing can also be used to mine and learn non-standard voice command information of group users, so that the voice assistant can Adapt to hot words, popular events, etc.
  • the user actively or passively uploads self-defined non-standard voice commands and their corresponding equivalent standard voice command information, and/or their feature information, and/or their associated algorithm model training update information to the cloud server.
  • the cloud server can extract the common information, train and update the public ASR module 22 and NLU module 23 of the voice assistant, that is, the default ASR algorithm and NLU algorithm carried by all users.
  • the cloud server delivers the trained and updated ASR and/or NLU algorithm to the client, and the trained and updated voice assistant APP can be delivered to group users through an updated version.
  • An embodiment of the present application provides a computer-readable storage medium, and the computer-readable storage medium contains computer-executable instructions for performing any one of the above methods.
  • An embodiment of the present application provides a system, and the system includes: the computer-readable storage medium provided in the second aspect; and a processor capable of executing computer-executable instructions.
  • An embodiment of the present application provides an electronic device, including: at least one memory for storing a program; and at least one processor for executing the program stored in the memory, when the program is executed by the processor, the electronic device performs the above One way.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product for realizing license plate number recognition includes one or more computer instructions for license plate number recognition. When these computer program instructions are loaded and executed on a computer, all or part of the process or function according to Figure 3 of the embodiment of the present application will be generated.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (eg coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (for example: floppy disk, hard disk, magnetic tape), an optical medium (for example: digital versatile disc (digital versatile disc, DVD)), or a semiconductor medium (for example: solid state disk (solid state disk, SSD) )wait.
  • a magnetic medium for example: floppy disk, hard disk, magnetic tape
  • an optical medium for example: digital versatile disc (digital versatile disc, DVD)
  • a semiconductor medium for example: solid state disk (solid state disk, SSD) )wait.
  • the program can be stored in a computer-readable storage medium.
  • the above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音交互方法及电子设备,涉及人工智能(artificial interlligence,AI)领域,应用于语音助手,该方法可以应用于语音助手也可以应用于电子设备,该方法包括接收第一语音指令(S302),第一语音指令无法被有效识别(S304),接收第二语音指令(S305),并建立第二语音指令与第一语音指令的关联关系(S307),第二语音指令对应第一响应,接收第三语音指令(S308),其中,第三语音指令与第一语音指令的内容或发音相同,响应于第三语音指令,执行与第二语音指令相同的第一响应(S309),提升交互体验和效率,提供更懂用户的个性化的语音助手。

Description

语音交互方法及电子设备
本申请要求于2021年9月18日提交中国专利局、申请号为202111101013.4,发明名称为“语音交互方法及电子设备”的中国专利的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能(artificial interlligence,AI)领域,尤其涉及一种语音交互的方法及电子设备。
背景技术
随着语音识别技术的发展,许多电子设备已具备语音交互功能,支持用户通过语音指令操控电子设备。上述电子设备中安装了语音助手(例如Siri、小爱同学、小E等),用户可以通过触发语音助手实现打开目标应用、播放音乐和查询天气等操作。
语音助手提供的语音交互功能,需要语音助手对用户语音指令进行准确识别才能执行用户想要的操作。语音助手能识别的语音指令通常是其内部的语音识别算法或模型经过训练后所能支持的语音指令,但有时用户发出的语音指令可能不同于语音助手能识别的语音指令,容易导致语音交互失败,用户体验差。
发明内容
本申请实施例公开了一种语音交互方法及电子设备,可以提高对语音指令识别的能力,提升用户使用体验。
第一方面,本申请实施例提供一种语音交互方法,该方法可以应用于语音助手也可以应用于电子设备,该方法包括接收第一语音指令,第一语音指令无法被有效识别,接收第二语音指令,并建立第二语音指令与第一语音指令的关联关系,第二语音指令对应第一响应,接收第三语音指令,其中,第三语音指令与第一语音指令的内容或发音相同,响应于第三语音指令,执行与第二语音指令相同的第一响应。
上述第一语音指令无法被有效识别包括无法识别第一语音指令的语义(如意图)。示例地,第一语音指令为“飚个歌吧”,电子设备上的语音助手无法识别到该第一语音指令的意图为“播放音乐”,则电子设备无法执行相应的操作。上述第一语音指令无法被有效识别还包括错误识别第一语音指令的语义(如意图)。示例性地,第一语音指令为“飚个歌吧”,其对应的意图为“播放音乐”,但电子设备上的语音助手错误识别该第一语音指令的意图为“开灯”,电子设备执行开灯操作。
上述第一响应即第一操作。示例性地,第一语音指令“飚个歌吧”,对应的意图为“播放音乐”,则其对应的第一响应为“播放音乐”。电子设备执行第一响应也即电子设备执行第一操作,具体可以包括,电子设备的语音助手对用户语音指令进行语音识别、语义理解等过程,确定出用户意图为播放音乐,并得到播放音乐的执行指令,电子设备响应该执行指令播音音乐。
基于上述方案,在语音助手或电子设备无法识别或无法正确识别用户输出的第一语音指 令对应的用户意图时,语音助手或电子设备继续使用拾音设备(如麦克风)采集用户的语音,以接收用户输入的用于解释或复述第一语音指令的第二语音指令,该第二语音指令能被有效识别,即能正确识别出该第二语音指令的用户意图。在正确识别出第二语音指令后,执行第二语音指令对应的第一响应。建立第二语音指令与第一语音指令的关联关系,根据该关联关系可以识别与第一语音指令的内容或发音相同的第三语音指令,执行与第二语音指令相同的第一响应。提高语音助手对语音指令识别的能力,提升用户使用体验,提供更懂用户的个性化的语音助手。
在第一方面的一种可能的实现方式中,接收第二语音指令包括:在无法生成第一语音指令的识别结果时,建立与第一语音指令相关联的学习会话;在学习会话期间,接收第二语音指令。相对于现有技术中的语音助手无法识别用户输出的第一语音指令时,就会结束对第一语音指令的处理,终止交互流程。现有技术中的语音助手对第一语音指令对应用户意图识别错误时,用户无法修正语音助手对第一语音指令的真实用户意图识别,本申请实施例提供的语音交互方法通过建立与第一语音指令相关联的学习会话,为用户提供一个解释流程,用户可以继续使用语音指令与电子设备进行语音交互,以通过另一语言表达内容向电子设备或语音助手再次表达上一条第一语音指令的内容,即用户换一种说法向电子设备或语音助手表达或解释上一第一语音指令的语义,以便语音助手能理解第一语音指令对应的用户意图。
在第一方面的一种可能的实现方式中,建立第一语音指令与第二语音指令的关联关系包括:检测到在学习会话期间接收到第二语音指令,建立第一语音指令与第二语音指令的关联关系。在建立学习会话期间,可以默认接收到的第二语音指令是用于解释第一语音指令的,则可以直接建立第一语音指令与第二语音指令的关联关系,提高语音交互效率。
在第一方面的一种可能的实现方式中,建立第一语音指令与第二语音指令的关联关系包括:检测触发指令;在检测到触发指令时,将在学习会话期间接收到的第二语音指令与第一语音指令建立关联关系。为了避免错误将与第一语音指令无关的语音指令作为对第一语音指令的解释,进而建立第一语音指令与第二语音指令的关系,则可以在接收到触发指令后才将第二语音指令与第一语音指令建立关联关系,提高关联的准确度。
在第一方面的一种可能的实现方式中,在学习会话期间,接收第二语音指令包括:在学习会话期间,接收语音输入,其中语音输入包括第二语音指令和用于解释第一语音指令的第一语音内容;则建立第一语音指令与第二语音指令的关联关系包括:在检测到第一语音内容时,建立第二语音指令与第一语音指令的关联关系。
上述第一语音内容可以为预设模板,如“我的意思是”或“我想表达的是”等。则语音助手在识别到“我的意思是”或“我想表达的是”,则认为该语音输入中包括用于解释第一语音指令的第二语音指令。通过可以将预设模板后的语音指令作为第二语音指令。
上述第一语音内容还可以为更灵活的语言表达,如“不是不是,应该是”或,“不是这个意思,是”等。即不需要根据模板匹配,语音助手可以通过识别第一语音内容的用户意图来确定该语音输入是否存在用于解释第一语音指令的第二语音指令。
基于上述方案,用户在语音交互的过程中即可以解释未被电子设备或语音助手有效识别的第一语音指令,提高了用户使用体验,更加智能化与人性化。
在第一方面的一种可能的实现方式中,在接收第二语音指令之前,还包括:
输出反馈以引导用户继续输入语音指令。通过输出反馈引导用户继续交互,便于用户了解与使用本申请实施例提供的语音交互方法。
在第一方面的一种可能的实现方式中,在接收第二语音指令之前还包括:响应于第一语音指令,执行第二响应,其中第二响应不同于第一响应。上述第二响应即语音助手错误识别用户意图后所执行的操作,通过电子设备执行第二响应,用户可以根据电子设备执行的第二响应了解到用户识别错误。该第二响应还可以包括执行完操作之后的输出反馈。如第一语音指令为“飚个歌吧”,其对应的意图为“播放音乐”,但电子设备上的语音助手错误识别该第一语音指令的意图为“开灯”,则第二响应包括电子设备把灯打开,向用户反馈“好的,已开灯”。
在第一方面的一种可能的实现方式中,建立第一语音指令与第二语音指令的关联关系包括:检测触发指令;在检测到触发指令时,建立第二语音指令与第一语音指令的关联关系。在语音助手识别错误后,用户主动去触发,告知语音助手识别错误,以建立第二语音指令与第一语音指令的关联关系,引导语音助手正确识别。
在第一方面的一种可能的实现方式中,接收第二语音指令包括:接收用户的语音输入,其中语音输入包括第二语音指令和用于指示第一语音指令的识别结果存在错误的第二语音内容。在第一方面的一种可能的实现方式中,建立第一语音指令与第二语音指令的关联关系包括:在检测到第二语音内容时,建立第二语音指令与第一语音指令的关联关系。
上述第二语音内容可以为预设模板,如“我的意思是”或“我想表达的是”等。则语音助手在识别到“我的意思是”或“我想表达的是”,则认为该语音输入中包括用于修正第一语音指令的第二语音指令。通过可以将预设模板后的语音指令作为第二语音指令。
上述第二语音内容还可以为更灵活的语言表达,如“不是不是,应该是”或,“不是这个意思,是”等。即不需要根据模板匹配,语音助手可以通过识别第一语音内容的用户意图来确定该语音输入是否存在用于修正第一语音指令的第二语音指令。
基于上述方案,用户在语音交互的过程中即可以修正被电子设备或语音助手有效识别的第一语音指令,提高了用户使用体验,更加智能化与人性化。
在第一方面的一种可能的实现方式中,建立第一语音指令与第二语音指令的关联关系包括:将第一语音指令等同于第二语音指令,或,将第二语音指令的第一响应与第一语音指令相关联。即直接将第一语音指令的内容如“飚个歌吧”等同于第二语音指令“播放音乐”,或,将第二语音指令的第一响应“播放音乐”与第一语音指令“飚个歌吧”关联。在下回接收到第三语音指令“飚个歌吧”,则可以直接将第三语音指令“飚个歌吧”等同于“播放音乐”,即对第三语音指令进行识别处理时,是以“播放音乐”进行处理。或,在下回接收到第三语音指令“飚个歌吧”,直接得到第三语音指令“飚个歌吧”的第一响应为“播放音乐”,电子设备直接执行“播放音乐”。
在第一方面的一种可能的实现方式中,语音交互方法还包括:根据关联关系生成训练数据集;将训练数据集用于训练语音助手的模型,以使得语音助手能处理适配用户语言习惯的语音指令。通过关联关系可以生成更多关于用户语言习惯表达的训练数据集,使用这些训练数据集训练语音助手,以使得语音助手能处理适配用户语言习惯的语音指令。
在第一方面的一种可能的实现方式中,根据关联关系生成训练数据集包括:将关联关系上传至云服务器;接收群体用户上传的关联关系,以生成适配群体用户语言习惯的训练数据集。通过关联关系可以生成更多符合群体用户语言习惯表达的训练数据集,使用这些训练数据集训练语音助手,以使得语音助手能处理适配群体用户语言习惯的语音指令。
在第一方面的一种可能的实现方式中,语音交互方法还包括:接收第四语音指令,其中 第四语音指令与第一语音指令的内容或发音不完全相同,第四语音指令与第一语音指令的内容或发音的相似度在第一范围内,响应于第四语音指令,执行与第二语音指令相同的第一响应。
其中,第一范围可以为基于语音助手鲁棒性判定第四语音指令与第一语音指令为实质相同的语音指令的范围,也可以为语音助手判定第四语音指令与第一语音指令相似的相似度范围。如基于语义助手鲁棒性,可以将第四语音指令“飚首歌吧”识别为与第一语音指令“飚个歌吧”是实质相同的语音指令。
在其中一种可能实现方式中,在接收第二语音指令之后,可以接收到第三语音指令或第四语音指令。在其中一种可能实现方式中,在接收到第二语音指令后,可以接收到第三语音指令和第四语音指令,如先接收到第三语音指令后再接收第四语音指令,或先接收第四语音指令再接收第三语音指令。
第二方面,本申请实施例提供一种计算机可读存储介质,计算机可读存储介质包含用于执行上述任一项的方法的计算机可执行指令。
第三方面,本申请实施例提供一种系统,系统包括:第二方面提供的计算机可读存储介质;和能够执行计算机可执行指令的处理器。
第四方面,本申请实施例提供一种电子设备,包括:至少一个存储器,用于存储程序;和至少一个处理器,用于执行存储器存储的程序,当程序被处理器执行时,以使得电子设备执行如上任一的方法。
上述其他方面对应的有益效果,可以参见关于方法方面的有益效果的描述,此处不予赘述。
附图说明
图1为本申请实施例提供的一种电子设备的结构示意图。
图2为本申请实施例提供的一种语音助手的结构示意图。
图3为本申请实施例提供的一种语音交互方法的流程示意图。
图4为本申请实施例提供的一种语音助手唤醒场景示意图。
图5(a)-5(b)为本申请实施例提供的一种第一语音指令交互场景示意图。
图6(a)-6(d)为本申请实施例提供的一种对第一语音指令反馈的场景示意图。
图7(a)-7(c)为本申请实施例提供的一种第二语音指令交互场景示意图。
图8为本申请实施例提供的一种对第二语音指令反馈的场景示意图。
图9(a)-9(b)为本申请实施例提供的一种第三语音指令交互场景示意图。
图10为本申请实施例提供的一种关联表应用场景示意图。
图11为本申请实施例提供的另一种关联表应用场景示意图。
图12为本申请实施例提供的另一种关联表应用场景示意图。
图13为本申请实施例提供的另一种关联表应用场景示意图。
图14为本申请实施例提供的另一种关联表应用场景示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本 文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,在本申请实施例的描述中,“多个”是指两个或多于两个。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
在本申请实施例中,“示例性地”、“例如”或“在一些示例中”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性地”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性地”、“例如”或“在一些示例中”等词旨在以具体方式呈现相关概念。
如上述,用户发出的语音指令为语音助手不能识别的语音指令将导致差的用户体验。本申请实施例提供了一种语音交互方法,语音助手无法识别用户输出的语音指令时,用户可以使用语音助手能识别的语音指令对该无法识别语音指令进行解释,语音助手可根据该解释改善对该无法识别语音指令的识别,实现语音助手自动适配拓展语音指令的识别能力。
本申请实施例中的电子设备可以为便携式计算机(如手机)、笔记本电脑、个人计算机(personal computer,PC)、可穿戴电子设备(如智能手表)、平板电脑、智能家居设备、增强现实(augmented reality,AR)\虚拟现实(virtual reality,VR)设备、人工智能(artificial intelligence,AI)终端(例如智能机器人)、车载电脑等,以下实施例对电子设备的具体形式不做特殊限制。
示例性的,图1示出了电子设备的结构示意图。
电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。
可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
在本申请实施例中,DSP可以实时监测语音数据,当DSP监测到的语音数据与电子设备中注册的唤醒词的相似度满足预设条件时,便可以将该语音数据交给AP。由AP对上述语音数据进行文本校验和声纹校验。当AP确定该语音数据与用户注册的唤醒词匹配时,电子设备便可以开启语音助手。
控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的 控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过电子设备100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141可接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,显示屏194,摄像头193,和无线通信模块160等供电。
电源管理模块141可用于监测电池容量,电池循环次数,电池充电电压,电池放电电压,电池健康状态(例如漏电,阻抗)等性能参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括一个或多个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。其中,调制器用于将待发送的低频基带信号调制成中高频信号。解调器用于将接收的电磁波信号解调为低频基带信号。随后解调器将解调得到的低频基带信号传送至基带处理器处理。低频基带信号经基带处理器处理后,被传递给应用处理器。应用处理器通过音频设备(不限于扬声器170A,受话器170B等)输出声音信号,或通过显示屏194显示图像或视频。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(Bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。 无线通信模块160可以是集成一个或多个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。在一些实施例中,手机100可以包括1个或N个摄像头,N为大于1的正整数。摄像头193可以是前置摄像头也可以是后置摄像头。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储一个或多个计算机程序,该一个或多个计算机程序包括指令。处理器110可以通过运行存储在内部存储器121的上述指令,从而使得电子设备100执行本申请一些实施例中所提供的语音交互的方法,以及各种功能应用和数据处理等。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统;该存储程序区还可以存储一个或多个应用程序(比如语音识别、图库、联系人等)等。存储数据区可存储电子设备使用过程中所创建的数据等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如一个或多个磁盘存储器件,闪存器件,通用闪存存储器 (universal flash storage,UFS)等。在另一些实施例中,处理器110通过运行存储在内部存储器121的指令,和/或存储在设置于处理器中的存储器的指令,来使得电子设备100执行本申请实施例中所提供的语音交互的方法,以及各种功能应用和数据处理。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置一个或多个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。
传感器模块180可以包括压力传感器,陀螺仪传感器,气压传感器,磁传感器,加速度传感器,距离传感器,接近光传感器,指纹传感器,温度传感器,触摸传感器,环境光传感器,骨传导传感器等,本申请实施例对此不做任何限制。
当然,本申请实施例提供的电子设备100还可以包括按键190、马达191、指示器192以及SIM卡接口195等一项或多项器件,本申请实施例对此不做任何限制。
本申请实施例中涉及的“语音助手”,又可以称之为“数字助理”、“虚拟助理”、“智能自动化助理”或“自动数字助理”等。“语音助手”可以理解为一种信息处理系统,其可以识别语音形式和/或文本形式的自然语言输入来推断用户意图并且基于推断出的用户意图来执行相应的动作。该系统可以通过可听(例如,语音)和/或可视形式来输出对用户的输入的响应。
示例地,用户可向语音助手提问,诸如“我现在在哪里?”基于用户的当前位置,语音助手可回答“您在中央公园西门附近。”用户还可请求执行任务,例如“打电话给麦克。”作为响应,语音助手可通过讲出“好的,马上”来确认请求,且语音助手执行拨打联系人“麦克”的电话的任务。除了提供语音响应并执行预置动作之外,语音助手还可提供其他视觉或音频形式(例如,作为文本、提示、音乐、视频、动画等)的响应。可以理解的,用户与语音助手还可以进行其他类型的交互,如聊天、游戏、知识问答等,且交互形式不限,本申请实施例对此不做限定。
请参阅图2,图2为本申请实施例提供的语音助手的功能架构图。下面对于语音助手中各个功能模块进行说明,如图2所示,语音助手包括前端处理模块21、ASR模块22、NLU模块23、DM模块24、NLG模块25和TTS模块26。
前端处理模块21用于将用户输入的语音指令,处理为后级算法所需的数据格式,如音频 特征向量,供ASR模块22使用。
示例性地,前端处理模块21获得用户输入的语音指令后,对该语音指令进行音频解码,解码成pcm格式的音频信号,然后利用声纹或其他特征对该音频信号进行分离、降噪、特征提取,并通过分帧、开窗、短时傅里叶变换等音频处理算法,得到梅尔频率倒谱分析(mel-frequency cepstral coefficients,MFCC)滤波器组(filter bank)的音频特征向量。前端处理模块21一般设置于终端侧。可以理解的是,语音助手也可以不包括独立的前端处理模块21,如前端处理模块21的功能可集成在语音识别模块22中。
语音识别(automatic speech recognition,ASR)模块22用于获取前端处理模块21处理得到的音频特征向量,并将音频特征向量转换为文本,供自然语言理解模块23进行理解。
ASR模块22,用于识别并输出文本识别结果。如ASR模块22使用一个或多个语音识别模型来处理前端处理模块21所提取的音频特征向量以产生中间识别结果(例如,音素、音素串和子字词),并且最终产生文本识别结果(例如,字词、字词串、或符号序列)。
示例的,该一个或多个语音识别模型(例如,声学模型和/或语言模型),可以包括隐马尔可夫模型、高斯混合模型、深层神经网络模型、n元语法语言模型或其他统计模型。声学模型用于把声学特征分类对应到(解码)音素或字词,语言模型用于把音素或字词解码成一个完整的文本。
示例性地,声学模型和语言模型通过串联的方式对音频特征向量进行处理,通过声学模型将音频特征向量转换为中间识别结果(例如,音素、音素串和子字词),再通过语言模型将音素或字词转换为产生文本识别结果(例如,字词、字词串、或符号序列),输出用户语音指令对应的文本或符号序列。
自然语言理解(Natural language understanding,NLU)模块23用于将用户语音指令对应的文本或符号序列进行语义识别,得到语义信息。即将用户语音对应的文本或符号序列转换为结构化信息,其中结构化信息包括技能、机器可执行的意图信息和可识别的槽位信息。NLU模块23其目的是通过语法、语义和语用的分析,获取用户所输入的自然语言的语义表示。
具体地,NLU模块23可以将用户语音对应的文本或符号序列进行技能分类、意图分类以及槽位提取。一般情况下,语音助手可以集成有多个具体的技能,语音助手可以维护一个技能清单,技能清单包括如图2中技能A、技能B、技能N等,每个技能对应着一种类型的服务或者功能,例如:订餐服务、打车服务、查询天气等。每个技能下可以配置有一个或多个意图。例如“天气查询”技能下可以配置有:问答意图“查天气”。每个意图下可以配置有一个或多个槽位。例如问答意图“查天气”可以配置有时间槽位和城市槽位。
对技能、意图和槽位进行说明。
(1)技能
技能可以是一项服务或功能,例如天气查询服务、机票预定服务等等。技能可以由第三方应用(如“天气”)或第三方平台的开发者来配置。一个技能下可以配置有一个或多个意图。
(2)意图
一个意图可以是一个技能下更为细化的服务或功能。意图可以分为对话意图和问答意图。需要参数化的应该使用对话意图,比如订购火车票意图,里面需要车次,出发时间等参数,则应该使用对话意图。问答意图更偏好于解决常见问题解答(Frequently Asked Questions,FAQ)类型的问题。比如退票费怎么收?一个意图下可以配置有一个或多个槽位。
(3)槽位
槽位为用户语句中用来表达用户意图的关键信息,例如,用户意图为对话意图“查天气”,那么NLU模块23需要从用户语音指令中提取的槽位为城市槽位和时间槽位。城市槽位用来表明查询“哪里”的天气,时间槽位用来表明查询“哪天”的天气。
对话管理(Dialog Management,DM)模块24用于根据NLU模块22输出的语义信息以及对话状态,输出下一步的动作,如包括判断应接入服务/平台、采取的反馈操作或回复的应答信息。
其中,DM模块24可用于维护和更新对话状态,并可用于根据对话状态和语义信息等,以决定下一步的动作。DM模块24可以由多个子模块组成。
具体地,DM模块24根据NLU模块23输出的语义,获得对应语音指令的任务,然后对接业务平台27以完成任务;或者,DM模块24根据语音指令对应的任务需要的信息,要求用户进一步输入更多的信息;或者,DM模块24获取语音指令所请求的信息返回给用户。
其中,DM模块24输出的不同技能可以对接不同的业务平台27,例如语义信息为听歌,则可以对接音乐播放平台,语义信息为看视频,则可以对接视频播放平台。
自然语言生成模块(natural language understanding,NLG)模块25用于将DM模块24输出的系统动作进行文本化,得到自然语言文本,并提供给TTS模块26。
语音合成(Text-to-Speech,TTS)模块26用于将NLG模块25生成的自然语言文本进一步转换为可播放的应答语音输出信号。
在本申请实施例中,电子设备可以根据DM模块24输出的指令,执行对应的操作。如果DM模块24输出的指令为用于指示输出语音的指令。此时,NLG模块25可以根据DM模块24输出的指令,生成语音信息,TTS模块26输出该语音。例如,用户输入的语音信息为“播放一首歌曲”,DM模块24输出用于指示输出语音的指令,NLG模块25根据用于指示输出语音的指令,生成输出语音文本“你要播放什么歌曲?”,TTS模块26根据NLG模块25输出语音文本输出语音“你要播放什么歌曲?”,并由电子设备播放该语音。
如果DM模块24输出的指令是其他类型的指令,电子设备则响应于该指令,执行对应的操作。示例性地DM模块24的输出可以具体表现为执行指令,该执行指令用于指示下一步的动作。例如,用户的输入语音信息为“播放歌曲A”,DM模块24输出播放歌曲A的执行指令,电子设备响应于该执行指令,自动播放歌曲A。
下面以通过语音助手控制设备开灯为例,描述语音助手的处理流程。示例的,语音助手可以为一个应用,或一个服务,或集成在其他应用或服务中的功能模块(例如API接口)等,本发明实施例对此不做限定。
搭载语音助手的电子设备(如智能音箱),接收用户输入的语音指令(如“开灯”),语音助手调用ASR模块22、NLU模块23和DM模块24,识别用户语音指令所对应的意图,并将其映射为对应的技能(如开灯技能),语音助手根据技能映射结果,通过对应的技能服务接口,将技能执行请求发送至相应的业务逻辑处理系统(如控制平台),业务逻辑处理系统按技能执行请求,控制对应的设备/平台(如电灯)执行对应服务(如开启电灯操作),电子设备向用户提供服务反馈(如语音播报“灯已打开”)。在一种可实现方式中,语音助手也可以直接控制灯的开关,可以不经过业务逻辑处理系统(如:控制平台)。本发明实施例对此不做具体限定。
在一些实施例中,语音助手的功能模块可以全部部署在电子设备上。该类电子设备可以包括智能机器人,也可以包括手机、车机、大屏等功能丰富的富设备。
在一些实施例中,语音助手的功能模块可以一部分部署在电子设备,一部分部署在服务器或其他设备上,例如前端处理模块21可部署于电子设备上。ASR模块可以部署在电子设备上,也可以ASR模块的一部分部署在电子设备上,一部分部署在服务器或其他设备上。NLU和DM模块的部署也可以类似前述的ASR模块的部署,本发明实施例对此不做具体限定。为了有更丰富的功能和服务,上述提及的手机、车机、大屏等电子设备也可以属于采用该类架构部署,一些其他瘦设备也可以采用该类架构部署。
在另一些示例中,语音助手可跨多个电子设备分布,协同来实现语音交互功能。
可以理解,这里所列举的产品形态进行示例性说明,并不应对本申请构成任何限定。
应当指出,语音助手可具有比图示更多或更少的部件、可组合两个或更多个部件、或可具有部件的不同配置或布局。图2中所示的各种功能模块可在硬件、用于由一个或多个处理器执行的软件指令、包括一个或多个信号处理集成电路和/或专用集成电路的固件、或它们的组合中实现。
以下实施例中所涉及的技术方案均可以在上述电子设备中实现。以下结合附图和应用场景对本实施例提供的语音交互方法进行详细介绍。需要说明的是,以下实施例中,电子设备以手机为例,语音交互功能的实现以手机中的语音助手APP为例。
请参阅图3,图3为本申请实施例提供的一种语音交互方法的流程示意图。该语音交互方法包括如下步骤:
步骤S301:用户启动手机上的语音助手。
在本申请实施例中,用户希望通过语音与手机进行交互时,用户可以先触发手机中的语音交互功能,如用户启动手机中的语音助手,使语音助手处于工作状态。
在一些示例中,步骤S301可以省略,如语音交互功能(如语音助手)可以免启动,用户可直接与语音助手进行语音交互。
启动语音助手可以包括但不限于如下几种方式:
方式一:用户可以通过语音启动(唤醒)语音助手。其中,唤醒语音助手的语音数据可以称为唤醒词(或唤醒语音)。该唤醒词可以预先注册在手机中,例如华为语音助手小艺的唤醒词为“小艺,小艺”。示例性的,搭载语音助手小艺的手机可将麦克风设置为常开状态(always on),进而,手机可通过麦克风实时检测用户输入的语音信号。当检测到用户输入唤醒词“小艺,小艺”的语音信号后,手机可唤醒安装于手机中的语音助手小艺,使语音助手小艺接收用户的语音指令。语音助手小艺被唤醒后,可应答用户输入的唤醒词“小艺,小艺”,输出“小艺在”的应答,并开始接收用户输入的语音指令。如图4所示,手机上可显示语音助手的对话界面501,对话界面501中可以实时显示用户与语音助手小艺的对话内容。
方式二:用户可以通过触控方式启动语音助手,例如通过长按home键、点击电源按键或手机界面上语音助手的应用图标,启动语音助手。
可以理解,对于搭载语音助手的车载设备,用户可以点击车载语音按键,对于搭载语音助手的智慧屏,用户可以点击遥控器按键。本申请实施例对此不作具体限定。
步骤S302:用户向手机发送第一语音指令。
在本申请实施例中,用户与手机交互的过程中,用户向手机发送语音指令1。
在一种示例中,若手机上的语音助手能准确识别语音指令1的语义,则手机执行对应的操作,和/或,控制其他设备执行对应的操作。
在另一种示例中,若语音助手不能识别语音指令1的语义,如不能识别语音指令1对应 的用户意图,则手机无法执行相应的操作。在一些示例中,语音助手不能识别语音指令1的语义时,可以给用户一个提醒,如手机语音提示用户“我不知道你说的是什么意思”。
在另一种示例中,若语音助手不能识别语音指令1的语义,如将语音指令1对应的真实用户意图A识别为意图B,则语音助手根据该意图B输出执行指令C,手机响应该执行指令C执行操作。用户可根据手机执行的操作了解到语音助手错误地识别语音指令1的语义。
其中,第一语音指令可以为用户在语音交互过程中任一次输入的语音指令。
用户与手机进行交互的语音指令很多样,如下示例几种可能语音助手不能有效识别语音指令1语义的情形:
情形一:用户语音指令的内容比较口语化或个性化。日常交互场景下,用户可能不会采用书面化句式或标准语音指令,例如“播放音乐”,用户输入的语音指令可能如“飙个歌吧”或“Music走起”等。
情形二:用户语音指令的关键词/对象表达不清晰。在特定时间点或事件背景下,用户可能不会在语音指令中采用完备或标准化的关键词描述。例如语音助手能识别语音指令“我要看《哈利波特2》”,而用户可能会偏向于采用简略或通俗化的关键词描述,例如“我要看哈2”。
情形三:用户语音指令模糊或歧义。用户发送的语音指令可能存在表意不清晰的问题,例如“我要看演员A做厨师的电影”,实际上标准语音指令为“我要看《电影B》(演员A主演)”。或,用户发送的语音指令带有方言或口音,例如“打电伐给老汉”(四川方言),而实际上语音助手可识别的标准语音指令为“打电话给爸爸”(普通话)。
上述情形只是示例,语音助手不能有效识别语音指令1语义的情形并不限于上述示例情形。在上述情形中,若语音助手的标准语音指令集(可有效识别的语音指令集)未覆盖支持类似上述示例中的非标准句式、非标准关键词或模糊歧义的语音指令,则语音助手可能会不能识别用户发送的语音指令1的语义,无法识别用户所发送的语音指令1对应的用户意图。也即语音助手能识别的语音指令集未覆盖到上述情形中的语音指令,换句话说,上述情形中的语音指令为语音助手不能有效识别的语音指令。语音助手可有效识别的语音指令集未覆盖支持类似上述示例中的非标准句式、非标准关键词或模糊歧义的语音指令,可能是语音助手的ASR模块和/或NLU模块不可有效识别上述非标准语音指令。
可以理解,循环神经网络(Recurrent Neural Network,RNN)、长短期记忆网络(Long Short Term Memory,LSTM)、Transformer等深度学习模型,可应用于图3所示语音助手的ASR模块和NLU模块中。在构建ASR模块和NLU模块的训练数据集时,通常使用预先设置的标准语音指令,如“播放音乐”、“请开灯”等。即语音助手的ASR模块和NLU模块有其内部的语音识别算法或模型经过训练后所能支持的语音指令集,语音助手可有效识别语音指令集中的指令,这些可被称之为标准语音指令,也即标准语音指令是能被语音助手直接有效识别其对应用户意图的语音指令。
若用户发送的第一语音指令属于上述三种情形中的一种或多种,或者说,用户发送的第一语音指令不同于语音助手能识别的标准语音指令,即第一语音指令为不同于标准语音指令的非标准语音指令,则语音助手可能会不能有效识别第一语音指令的语义。
需要说明的是,第一语音指令不限于上述三种情形,本申请实施例对此不作具体限定。
如图3所示,下述实施例以第一语音指令为“飚个歌吧”为例。在一种示例中,如图5(a)所示,语音助手接收到该第一语音指令后,可以在对话界面601上显示第一语音指令对应的文本内容。
在其中一种可能实现方式中,步骤S301和步骤S302可以合并为一个步骤,用户输入的语音信号,可以是以语音唤醒词开头的语音信号。例如,“小艺,小艺,飚个歌吧”、“小艺,小艺,请共享会议室的屏幕”、“小艺,小艺,我要结束会议”。则唤醒词后面的语音信号为用户输入的语音指令,如“飚个歌吧”、“请共享会议室的屏幕”以及“我要结束会议”为用户向语音助手发送的语音指令。语音助手检测到唤醒词后,接收语音指令,手机上显示语音助手的对话界面,如图5(b)所示,对话界面602上显示用户输入的语音信号“小艺,小艺,飚个歌吧”的文本内容。
步骤S303:手机识别第一语音指令对应的第一用户意图。
在一种示例中,手机可以使用语音交互功能来识别第一语音指令对应的第一用户意图,如识别第一语音指令对应的第一用户意图可以由语音助手完成。手机上的麦克风将采集到的用户语音信号(第一语言指令)转发给语音助手的前端处理模块21,由前端处理模块21对该语音信号进行预处理,得到预处理的语音信号,将预处理的语音信号输入至ASR模块22。ASR模块22将预处理的语音信号转化为对应的文本,得到第一文本。可选的,第一文本也可以是语音助手将转化的文本进行文本处理,如文本归一、纠错、书面化处理等,后得到的文本。将第一文本输入NLU模块23。NLU模块23识别第一文本的语义,对第一文本分词、词性标注、关键词提取等处理操作,提取出第一语音指令对应的第一用户意图。其具体实现方式可以参考图2,在此不再赘述。
在另一种示例中,若用户所发送的第一语音指令属于非标准语音指令,语音助手可能无法识别第一语音指令对应的第一用户意图,如第一语音指令“飚个歌吧”对应的第一用户意图为“播放音乐”,但是语音助手可能识别不出来。或者,语音助手不能正确识别到第一语音指令对应的第一用户意图为“播放音乐”。又如第一语音指令“我要看哈2”对应的第一用户意图为“打开《哈利波特2》”,但是语音助手可能识别为其他视频。
步骤S304:在手机无法识别第一语音指令对应的第一用户意图时,手机输出无法识别反馈。在本申请实施例中,语音助手的ASR模块22对第一语音指令的识别失败,或NLU模块23对第一语音指令的识别失败,则语音助手不能识别出第一语音指令的语义,语音助手无法理解第一语音指令对应的第一用户意图。语音助手通过手机向用户输出无法识别反馈,以向用户表达语音助手不能理解或不能识别第一语音指令所对应的第一用户意图的事实。
在其中一种可能实现方式中,无法识别反馈可以通过文本形式显示在对话界面上,如图6(a)所示,手机在对话界面701上显示无法识别反馈“我不清楚你的意思”的文本内容。
在其中一种可能实现方式中,手机可以通过语音形式向用户输出无法识别反馈,如向用户输出“我不清楚你的意思”的语音。
在本申请实施例中,该无法识别反馈还可以为“我听不懂你的意思”或“小艺无法理解你的意思”等,本申请实施例对此不作具体限定。
在其中一种可能实现方式中,步骤S304可以为:在手机无法识别第一语音指令对应的第一用户意图时,手机输出用于引导用户继续进行后续交互的引导信息。
在本申请实施例中,在语音助手无法识别第一语音指令对应的语义时,语音助手还可以输出用于引导用户继续进行后续交互的引导信息,并由手机向用户输出该引导信息。
其中,该引导信息可以为请求用户继续进行交互的信息,如“请再说一遍”、“请您慢点说”或“请您用普通话再说一遍”,也可以为对用户发送的语音指令的疑问,如“您刚刚是什么意思”或“您刚刚说了什么”。本申请实施例对此不作具体限定。
在其中一种可能实现方式中,该引导信息也可以通过文本形式显示在对话界面上。示例性地,如图6(b)所示,手机在对话界面702上显示引导信息“请你再说一遍。”的文本内容。
在其中一种可能实现方式中,手机可以通过语音形式向用户输出引导信息,如手机向用户输出“请再说一遍”的语音。
在其中一种可能实现方式中,步骤S304可以为:在手机无法识别第一语音指令对应的第一用户意图时,手机输出无法识别反馈和用于引导用户继续进行后续交互的引导信息。
其中,无法识别反馈和引导信息可以采用上述文本形式或语音形式输出。如可以同时以文本形式输出无法识别反馈和引导信息的内容,或同时以语音形式输出无法识别反馈和引导信息的内容,或,无法识别反馈和引导信息各自采用不同形式输出,本申请实施例对此不作具体限定。
示例性地,如图6(c)所示,手机在对话界面703上显示无法识别反馈和引导信息的文本内容“我不清楚你的意思,请你再说一遍。”。
可以理解,手机还可以通过其他形式输出无法识别反馈或引导信息,如手机振动或手机以不同振动频率振动、指示灯亮灯。不同的电子设备也可以有不同的指示形式,如智能音箱可以控制LED灯亮,或灯亮的频率来指示。本申请实施例对此不作具体限定。
在其中一种可能实现方式中,步骤S304可以为:手机根据识别到的第一语音指令的用户意图执行操作。
在本申请实施例中,步骤S303手机识别到的第一语音指令对应的第一用户意图为错误的,手机不能正确识别第一语音指令对应的第一用户意图,将第一语音指令对应的第一用户意图识别为其他错误意图,则步骤S304手机根据错误意图执行错误操作。
示例性地,用户输入第一语音指令“飚个歌吧”,语音助手将该第一语音指令的用户意图识别为“开灯”。如图6(d)所示,在对话界面704上显示语音助手执行操作后的反馈,如“好的,已开灯”。用户可以通过手机执行操作后的反馈了解到语音助手对其真实意图识别错误。
步骤S305:用户向手机发送第二语音指令。
现有技术中的语音助手无法识别用户输出的第一语音指令时,就会结束对第一语音指令的处理,终止交互流程。如用户通过语音指令与手机进行交互时,若手机的语音交互功能(语音助手)无法识别用户的语音指令,则会结束对话。现有技术中的语音助手对第一语音指令对应用户意图识别错误时,用户无法修正语音助手对第一语音指令的真实用户意图识别。而执行本申请实施例提供的语音交互方法的语音助手可以提供解释功能:在语音助手无法识别或无法正确识别用户输出的第一语音指令对应的第一用户意图时,语音助手继续使用拾音设备(如麦克风)采集用户的语音,以接收用户输入的用于解释或复述第一语音指令的第二语音指令。可以理解为,执行本申请实施例提供的语音交互方法的语音助手提供一个解释流程,用户可以继续使用语音指令与语音助手进行语音交互,以通过另一语言表达内容向语音助手再次表达上一条第一语音指令的内容,即用户换一种说法向语音助手表达或解释上一第一语音指令的语义,以便语音助手能理解第一语音指令对应的用户意图。则语音助手可以根据用户输入的语音指令得知第一语音指令对应的用户意图,或修正对第一语音指令对应的用户意图的识别结果。
在本申请实施例中,语音助手无法识别用户输出的第一语音指令时,用户可以发送第二 语音指令,通过第二语音指令来解释第一语音指令,从而使语音助手可以有效执行第一语音指令对应的响应,丰富用户与电子设备的语音交互。如图3所示,用户输入第二语音指令“播放音乐”,其中第二语音指令“播放音乐”为语音助手可以有效识别的语音指令。
在本申请实施例中,在语音助手无法识别第一语音指令时,即语音助手无法生成第一语音指令的识别结果(如用户意图),语音助手可以建立与所述第一语音指令相关联的学习会话,以在该学习会话期间可以继续接收用户输入的第二语音指令。
在其中一种可能实现方式中,在语音助手生成第一语音指令的识别结果后,语音助手还可以建立与第一语音指令相关联的学习会话,以在该学习会话期间可以继续接收用户输入的语音指令。
在其中一种可能实现方式中,在语音助手生成第一语音指令的识别结果后,语音助手可以继续接收用户输入的第二语音指令,检测到该第二语音指令为用于解释第一语音指令时,如该第二语音指令为使用下述第二种方式或第三种方式发送的语音指令时,则语音助手可以选择建立与第一语音指令相关联的学习会话,或不建立与第一语音指令相关联的学习会话。
在本申请实施例中,用户发送第二语音指令的方式可以包括如下:
第一种方式,用户直接向手机发送第二语音指令。第一种方式是在语音助手或手机无法识别第一语音指令对应的第一用户意图的场景下使用。即在语音助手无法识别第一语音指令的语义时,手机接收到第二语音指令用于解释或复述第一语音指令。语音助手默认再接收到的用户发送的第二语音指令是用于解释或复述第一语音指令;或者,语音助手在无法识别第一语音指令时,在预设时间内接收到的第二语音指令为用于解释或复述第一语音指令;等。
具体地,在步骤S303中语音助手无法识别第一语音指令对应的第一用户意图,在步骤S304中语音助手通过手机输出无法识别反馈或引导信息或无法识别反馈和引导信息后,用户了解到语音助手无法识别第一语音指令对应的用户意图,用户继续向手机发送第二语音指令如“播放音乐”。手机将接收到的第二语音指令转发至语音助手,语音助手默认该第二语音指令“播放音乐”是对上一无法识别的第一语音指令“飚个歌吧”的复述,或为对上一无法识别的第一语音指令“飚个歌吧”的解释。
第二种方式,用户可以采用预设模板向手机发送第二语音指令,即用户输入的语音包括预设模板和第二语言指令。该预设模板用于表征当前采用预设模板发送的第二语音指令,是用于对上一第一语音指令的复述或解释。当语音助手检测到该预设模板时,语音助手认为采用该预设模板所发送的第二语音指令即为用于对上一条第一语音指令进行解释或复述的语音指令。预设模板可以是包含解释性内容的指令形式,本发明实施例对预设模板的形式不做具体限定。
示例性地,该预设模板可以为固定句式,如“我的意思是”、“我上一句的意思是”或“我刚才说的是”等。即,在用户使用第二种方式向手机发送第二语音指令时,所输入的语音为“我的意思是播放音乐”、“我上一句的意思是播放音乐”或“我刚才说的是播放音乐”等。该预设模板还可以为一个预设词语,如“解释”、“复述”或“修正”等。即,在用户向手机发送第二语音指令时,所输入的语音为“解释,播放音乐”、“复述,播放音乐”或“修正,播放音乐”等。
第三种方式,用户可以进行触发操作后或进行触发操作时向手机发送第二语音指令。该触发操作所对应的触发指令用于表征手机接收到的第二语音指令是用于对上一条第一语音指令的复述或解释。示例性地,触发操作可以为对UI虚拟按键的触发操作,语音助手的对话界 面上可以呈现UI虚拟按键,如语音助手可在无法有效识别第一语音指令时显示改UI虚拟按键,也可以一直显示该UI虚拟按键。该UI虚拟按键触发后,用户可以继续向手机输入语音指令用于复述或解释上一条语音指令。以UI虚拟按键为例,用户点击手机上所显示的UI虚拟按键,并同时输入语音指令“播放音乐”,语音助手检测到用户点击对话界面上UI虚拟按键的触发操作,语音助手接收到该触发操作对应的触发指令后,将手机接收到的语音指令“播放音乐”作为对第一语音指令“飚个歌吧”的复述,或作为对第一语音指令“飚个歌吧”的解释。
可以理解,触发操作还可以为对物理按键的触发操作,物理按键可以为手机上的home键、电源按键、车载语音按键或智慧屏的遥控器按键。触发操作还可以为预设的手势等。本申请实施例对此不作具体限定。
需要说明的是,在第三种方式中,可以通过对语音用户界面(voice user interface,VUI)的开发,为用户提供更多可用于进行触发操作的选项,本申请实施例对此不作具体限定。
在上述第一种方式中,语音助手在无法识别第一语音指令时,语音助手默认在第一语音指令之后接收到的语音指令为用于对第一语音指令进行解释或复述的。用户可以直接与语音助手进行解释,无须像第二种方式中需要采用预设模板发送第二语音指令,也无须像第三种方式中需要进行触发操作,节省用户操作流程,用户与手机的交互更智能化更人性化。
上述第二种方式可以在语音助手无法识别第一语音指令对应的第一用户意图时使用,即可以在语音助手不能识别输出第一语音指令对应的第一用户意图,或,语音助手无法正确识别第一语音助手对应的第一用户意图使用。示例地,用户了解到语音助手无法识别输出第一语音指令对应的第一用户意图,手机输出无法识别反馈后。用户采用预设模板向手机发送第二语音指令,如向手机输入语音“我的意思是播放音乐”。语音助手检测到该语音输入中包括预设模板“我的意思是”,则将第二语音指令“播放音乐”作为对第一语音指令“飚个歌吧”的复述,或作为对第一语音指令“飚个歌吧”的解释。或,用户了解到语音助手对第一语音指令对应的第一用户意图识别错误,如图6(d),语音助手将第一语音指令“飚个歌吧”的用户意图“播放音乐”识别为“开灯”,用户采用预设模板向手机手发送第二语音指令,如向手机输入语音“我的意思是播放音乐”。语音助手检测到该语音输入中包括预设模板“我的意思是”,语音助手识别到当前输入的语音是用于对上一第一语音指令“飙个歌吧”的解释,将第二语音指令“播放音乐”作为对第一语音指令“飚个歌吧”的解释。则语音助手可以将“飚个歌吧”的用户意图由“开灯”修正为“飚个歌吧”。
第二种方式相对于第一种方式,用户可以根据自己的需求对第一语音指令进行解释或复述,不受语音助手引导流程限制。即在第二种方式中,不限定语音助手在无法识别第一语音指令时,默认在第一语音指令之后接收到的语音指令为用于对第一语音指令进行解释或复述的。用户可以在第一语音指令之后输入与第一语音指令无关的语音指令,而语音助手也不会把该无关的语音指令作为对第一语音指令的解释,仅在语音助手检测到用户采用了预设模板,才认为采用预设模板的语音指令为对上一条未识别的语音指令的复述或解释。且第二种方式还可以在了解到语音助手识别错误时候对语音助手的识别进行修正。第二种方式相对于第三种方式,第二种方式通过语音交互即可以完成对第一语音指令的解释或复述,提高用户的使用体验。
上述第三种方式与第二种方式相似,可以在语音助手不能识别输出第一语音指令对应的第一用户意图,或,语音助手无法正确识别第一语音助手对应的第一用户意图时使用,具体 原理可以参照上述描述,在此不再赘述。在上述第三种方式与第二种方式的区别在于,可以为用户提供额外的交互体验,用户可以通过对物理按键、UI虚拟按键或使用预设手势进行触发操作,进而可以在触发操作时或触发操作之后发送第二语音指令,语音助手或手机在检测到触发操作对应的触发指令后,将第二语音指令作为对上一条第一语音指令的复述或解释。
需要说明的是,本申请实施例不限于在单轮语音交互还是多轮语音交互。在单轮交互场景下,语音助手使用本申请实施例的语音交互方法,可以在无法识别第一语音指令对应的用户意图时,语音助手继续使用拾音设备(如麦克风)采集用户的语音,在接收到用户输入的语音指令,或接收到用户采用预设模板发送的语音指令或监测到触发指令时,将单轮交互转化为多轮交互。
在其中一种可能实现方式中,若用户在语音助手对第一语音指令识别失败后,继续输出语音指令2,该语音指令2可能并非是标准语音指令,语音助手仍无法识别该语音指令2或无法正确识别该语音指令2,可设置预设的交互次数或时长阈值,若在预设的交互次数或时长阈值内,语音助手均无法识别用户继续输入的语音指令2,则语音助手将交互过程所涉及的数据(包括但不限于语音指令数据、程序日志等),上传至云服务器,以由人工进行识别,由人工将语音指令2与第一语音指令关联。
在其中一种可能实现方式中,在预设的时间范围内,如接收到用户输入的语音指令,或接收到用户采用预设模板发送的语音指令或监测到触发指令,则继续下述步骤,若超出该预设的时间范围,未接收到用户输入的语音指令,或未接收到用户采用预设模板发送的语音指令或未监测到触发指令,则结束交互流程。
步骤S306:手机识别第二语音指令,并执行所述第二语音指令对应的第一操作。
手机语音助手可识别第二语音指令对应的第二用户意图,并根据所述第二用户意图执行与所述第二意图对应的第一操作。
在本申请实施例中,手机可以使用语音交互功能来识别第二语音指令对应的第二用户意图,如语音助手识别第二语音指令对应的第二用户意图。具体流程可以参考步骤S303,在此不再赘述。
在本申请实施例中,S303在使用预设模板发送第二语音指令时,手机接收到的语音输入包括了第二语音指令如“播放音乐”以及预设模板如“我的意思是”,语音助手识别出该语音输入中预设模板“我的意思是”,将语音输入除预设模板“我的意思是”外的内容识别为第二语音指令,得到第二语音指令“播放音乐”,识别第二语音指令“播放音乐”对应的第二用户意图的具体实现方式可以参考步骤S303,在此不再赘述。
以步骤S304中手机输出无法识别反馈与引导信息“我不清楚你的意思,请你再说一遍”,为例。
在用户采用第一种方式发送第二语音指令时,用户直接向手机输入第二语音指令“播放音乐”,语音助手接收到第二语音指令“播放音乐”后,如图7(a)所示,在对话界面801上显示第二语音指令的文本内容。
在用户采用第二种方式发送第二语音指令时,语音助手接收到的语音输入为“我的意思是播放音乐”后,如图7(b)所示,在对话界面802上显示该语音的文本内容。
在用户采用第三种方式发送第二语音指令时,如图7(c)所示,手机上呈现的对话界面803上显示有UI虚拟按键804,用户点击UI虚拟按键804,并向手机输入第二语音指令“播放音乐”。手机在对话界面803上显示第二语音指令的文本内容。
在本申请实施例中,手机识别到第二语音指令“播放音乐”对应的第二用户意图为“播放音乐”(示例如Intent:Play Music)。
示例的,若第二指令为“播放音乐”,手机执行的第一操作为音乐播放的操作,如打开音乐播放App或音乐播放服务,向用户播放歌曲。
在一种示例中,如果用户的语音指令中没有歌曲名称实体,手机可以根据预设推荐规则,确定出推荐歌曲,然后播放给用户。例如,手机可以根据用户的历史播放记录,将最近7天内播放最多的歌曲作为推荐歌曲。手机响应于播放音乐的执行指令,自动播放确定出的推荐歌曲,并显示于对话界面。示例性地,在如图7(c)所示的交互基础上,语音助手识别出第二语音指令的第二用户意图,手机上播放歌曲,如图8所示,手机输出反馈,如播报“好的,开始播放音乐”,并在对话界面901上显示针对用户语音指令的应答语句文本内容以及音乐控件902。此时,音乐控件902内显示正在播放的歌曲为手机正在播放的歌曲。
步骤S307:手机建立第一语音指令与第二语音指令的关联关系。
在本申请实施例中,可以将第一语音指令与第二语音指令的关联关系存储在手机本地,也可以存储至云服务器,本申请实施例对此不作具体限定。对于关联关系的形式,本申请实施例也不做具体限定。
其中,步骤S307可以由手机上的语音助手来完成。
在本申请实施例中,语音助手检测到用于复述或解释未被识别的第一语音指令的第二语音指令,建立第一语音指令与第二语音指令的关联关系。或者说,语音助手在检测到用户是使用上述三种方式中的任一方式发送第二语音指令时,建立第一语音指令与第二语音指令的关联关系。
其中,步骤S307在用户执行步骤S305后执行。语音助手识别到第二语音指令为对第一语音指令进行解释或复述的语音指令时或之后,语音助手可在任意一个时间执行步骤S307。即,步骤S307可以在步骤S306之前或之后或同时执行,步骤S307还可以在当前语音交互流程结束后执行,即允许语音助手离线执行步骤S307。比如说,在语音助手退出运行,或手机关机充电的时候语音助手可以执行步骤S307。
在本申请实施例中,语音助手检测到用于复述或解释未被识别第一语音指令的第二语音指令,可以包括如下方式:
1)语音助手默认在语音助手无法识别第一语音指令对应的第一用户意图时,语音助手再接收到的语音指令即为用于复述或解释未被识别第一语音指令的第二语音指令。即,默认用户在手机输出无法识别反馈、引导信息或无法识别反馈与引导信息后,用户再向手机所输出的语音指令为用于复述或解释未被识别第一语音指令的第二语音指令。
2)在检测到用户输入的语音输入中包括预设模板,则该语音输入中包括用于复述或解释未被识别第一语音指令的第二语音指令。即语音助手在语音交互过程中,检测用户输入的语音输入是否有预设模板,若有,则语音助手检测该语音输入中是否包括除了预设模板之外的语音指令,若有,则该语音指令为包括用于复述或解释未被识别第一语音指令的第二语音指令。
3)在检测到用户进行触发操作之后的触发指令时或后,手机再接收到的语音指令为第二语音指令。即临近产生触发指令时刻(如产生触发指令时或之后),语音助手再接收到的语音指令作为第二语音指令。
在本申请实施例中,建立第一语音指令与第二语音指令的关联关系,可以理解为将第一 语音指令映射为第二语音指令,即认为第一语音指令对应的第一用户意图与第二语音指令对应的第二用户意图相似或一致。在识别第一语音指令对应的第一用户意图时,可以将第二语音指令对应的第二用户意图作为第一语音指令对应的第一用户意图。
在其中一种可能实现方式中,建立第一语音指令与第二语音指令的关联关系,可以为将第一语音指令等同于第二语音指令。在将第一语音指令等同于第二语音指令后,后续再接收到与第一语音指令的定义或发音相同的第三语音指令时,将第三语音指令替换为第二语音指令,对第二语音指令进行识别处理得到第二用户意图,将第二用户意图作为第三语音指令对应的用户意图输出。
在其中一种可能实现方式中,建立第一语音指令与第二语音指令的关联关系,可以将第二语音指令的第二用户意图与第一语音指令相关联,即将第二用户意图作为第一语音指令的用户意图。则在将第二语音指令的第二用户意图与第一语音指令相关联后,后续再接收到与第一语音指令的定义或发音相同的第三语音指令时,无须识别第三语音指令,直接获取第二用户意图并输出。
在本申请实施例中,在语音助手建立了第一语音指令与第二语音指令的关联关系后,可以得到关联表。语音助手可以根据该关联关系(或关联表)拓展其意图理解能力,使得语音助手由原来的不能识别第一语音指令对应的第一用户意图,变为可以识别第一语音指令对应的第一用户意图。即,当用户采用语音助手的ASR模块和/或NLU模块的标准语音指令集未覆盖的非标准句式或非标准关键词,或带有模糊歧义的语音指令,即采用非标准语音指令与手机进行交互时,若语音助手识别该非标准语音指令的用户意图失败,则使用执行本申请实施例的语音交互方法的语音助手或手机可以通过输出无法识别反馈或引导信息来引导用户对第一语音指令进行解释或复述,接收用户输入的用于复述或解释未被识别第一语音指令的第二语音指令,建立第一语音指令与第二语音指令的关联关系,语音助手根据该关联关系自学习或自更新其模型或关联表,拓展语音助手支持非标准句式或非标准关键词,或带有模糊歧义的语音指令的意图理解能力。换句话说,语音助手提供解释功能,用户利用该解释功能,通过标准语音指令与其进行交互,引导语音助手拓展意图理解能力,从而使得语音助手快速支持用户自定义的非标准句式/非标准关键词、和/或歧义指令的非标准语音指令,解决语音助手无法识别非标准语音指令,无法识别口语化、个性化语音指令的问题,丰富用户与电子设备的语音交互。
下文将详细描述如何根据第一语音指令与第二语音指令的关联关系,改善语音助手对第一语音指令的识别,以拓展语音助手的意图理解能力,支持上述非标准句式或非标准关键词,或带有模糊歧义的非标准语音指令。
步骤S308:用户向手机发送第三语音指令。
在语音助手建立第一语音指令与第二语音指令的关联关系后,拓展了语音助手的识别能力后,语音助手可以有效识别第一语音指令。用户可以在下一次语音交互时使用与第一语音指令语音内容相同或相似的第三语音指令如“飚个歌吧”与手机进行交互,如图3所示,用户第N次向手机发送第一语音指令“飚个歌吧”,其中,N为大于1的整数。
在一种示例中,用户向手机发送第三语音指令时,用户向手机输入语音“飚个歌吧”,无论ASR模块将该语音“飚个歌吧”识别为“飚个歌吧”还是“飚个个吧”,基于ASR模块内的纠错功能,也会将最终输出的文本纠正为“飚个歌吧”。又或者,用户向手机输入语音“飚各个吧”,ASR模块识别出该语音对应的文本为“飚各个吧”,基于ASR模块内的纠错功能, 也会将最终输出的文本纠正为“飚个歌吧”。则该第三语音指令是与第一语音指令的语音内容相似的语音指令。语音助手将第三语音指令认为是在语音识别结果上与第一语音指令强关联的语音指令,进而将第三语音指令与第一语音指令关联,由此可以认为第三语音指令对应的第三用户意图即第一语音指令对应的第一用户意图。
步骤S309:手机根据第三语音指令执行第一操作。
在本申请实施例中,语音助手在识别出第三语音指令是与第一语音指令在语音识别结果上认为是基于相同的语音指令,如识别出的第三语音指令与第一语音指令的内容或发音相同,基于第一语音指令与第二语音指令关联,手机执行与所述第二语音指令的响应相同的第一操作。具体内容可以参考步骤S306,在此不再赘述。
可以理解,NLU模块在识别语音指令对应的用户意图时,NLU模块被设计为有一定的鲁棒性或纠错能力,即使NLU模块接收到ASR模块传输的文本与标准文本有细微的差别,例如NLU模块接收到文本“播放一个音乐吧”,而非标准文本“播放一首音乐”,其也能够正确识别出对应用户意图为“播放音乐”。又如,NLU模块接收到文本“飚首歌吧”,而非标准文本“飚个歌吧”,其也能正确识别出对应用户意图为“播放音乐”。可以理解的是,此处示例的NLU鲁棒性或纠错能力范围内的不同语音指令,属于基本相同的语音指令。可以理解的是,内容或发音不完全相同的两个语音指令可以属于基本相同的语音指令。
在其中一种可能的实现方式中,语音交互方法还包括:接收第四语音指令,其中第四语音指令与第一语音指令的内容或发音不完全相同,第四语音指令与第一语音指令的内容或发音的相似度在第一范围内,响应于第四语音指令,执行与第二语音指令相同的第一响应。
第四语音指令可以为基于语音助手鲁棒性判定第四语音指令与第一语音指令为实质相同的语音指令,也可以为语音助手判定与第一语音指令相似度在第一范围内的语音指令。如基于语义助手鲁棒性,可以将第四语音指令“飚首歌吧”识别为与第一语音指令“飚个歌吧”是实质相同的语音指令。或,语音助手判定第四语音指令“飚首歌吧”与第一语音指令“飚个歌吧”相似的相似度为95%,第一范围为90%至99%。
在其中一种可能实现方式中,在接收第二语音指令之后,可以接收到第三语音指令或第四语音指令。在其中一种可能实现方式中,在接收到第二语音指令后,可以接收到第三语音指令和第四语音指令,如先接收到第三语音指令后再接收第四语音指令,或先接收第四语音指令再接收第三语音指令。
示例性地,以语音助手在检测到第二语音指令之后,就建立第一语音指令与第二语音指令的关联关系,拓展其对第一语音指令意图理解能力,用户在第一次使用第一语音指令进行交互的交互流程中继续以第三语音指令“飚个歌吧”与手机进行交互为例,手机在步骤S306执行第一操作“播放音乐”后,手机输出反馈。用户继续向手机发送第三语音指令“飚个歌吧”,如图9(a),手机在对话界面101上显示包括用户输出第三语音指令时对应的语音文本内容、针对用户语音的回答语句文本内容“好的,开始播放音乐”,以及音乐控件102。此时,音乐控件102内显示手机正在播放的歌曲。
用户发送第三语音指令不限于其第一次使用第一语音指令进行交互的交互流程中,也可以在该交互流程结束后。示例性地,在星期一,用户使用第一语音指令“飚个歌吧”与手机进行交互,语音助手或手机基于上述步骤已拓展了语音助手对第一语音指令意图理解能力。在星期二,如图9(b),用户向手机输入语音“小艺,小艺,飚个歌吧”,用户唤醒手机上的语音助手,语音助手识别第三语音指令“飚个歌吧”的语义,根据第一语音指令与第二语音 指令之间的关联关系,然后根据第三语音指令的用户意图执行第一操作。如图9(b),手机在对话界面103上显示对话历史、当前输入的第三语音指令的文本内容“飚个歌”,以及语音助手针对用户语音的回答语句文本内容“好的,开始播放音乐”,以及音乐控件104。此时,音乐控件104内显示手机正在播放的歌曲。
在其中一种可能实现方式中,在第一语音指令与第二语音指令关联之后,用户在下回使用第一语音指令与手机进行交互,不限于使用语音方式,可以通过文本形式也可以通过语音形式与手机进行交互。如,语音助手将“播放音乐”与“Music走起”关联,则下回用户操作手机,在语音助手的对话界面上向语音助手发送文本内容“Music走起”,语音助手均能识别该文本内容“Music走起”对应的用户意图为“播放音乐”。
在其中一种可能实现方式中,本申请实施例提供的语音交互方法可以在上述对话界面上实现,也可以在设置界面上实现。执行本申请实施例提供的语音交互方法的电子设备可以提供一设置界面,该设置界面可以供用户进行语音指令设置。用户可在该设置界面上进行语音指令关联设置。如用户向设置界面输入第一语音指令,后续再向设置界面输入第二语音指令,语音助手将第一语音指令与第二语音指令关联。
可以理解,在设置界面上输入第一语音指令或第二语音指令可以通过语音输入或文本输入,本申请对此不作具体限定。
可以理解,本申请实施例提供的语音交互方法,不限于语音助手未能识别语音指令场景。本申请实施例提供的语音交互方法还可以根据用户个人需求应用于各种场景,如:
场景一:将本申请实施例提供的语音交互方法应用于对语音助手个性化语音指令的设置。用户可以根据个人语言习惯或需求去调整语音助手对语音指令的语义识别。例如用户习惯使用“Music走起”,但语音助手不能识别出“Music走起”对应的用户意图为“播放音乐”,则用户可以主动使用第二语音指令“播放音乐”解释“Music走起”。手机执行上述步骤将“播放音乐”与“Music走起”关联。则下回用户再以指令“Music走起”(语音形式或文本形式)与手机进行交互的时候,语音助手即可以识别出“Music走起”对应的用户意图为“播放音乐”。
场景二:将本申请实施例提供的语音交互方法应用于特殊人群特殊语音指令的设置。特殊人群如外国人、老人或小孩等无法输出标准语音指令。例如小孩可能会把“播放音乐”说成“波放乐”,则在小孩向手机输出第一语音指令“波放乐”后,大人可以向手机发送第二语音指令“播放音乐”。语音助手可以将第一语音指令“波放乐”与第二语音指令“播放音乐”关联。在小孩再向手机发送语音指令“波放乐”,语音助手会根据第一语音指令“波放乐”与第二语音指令“播放音乐”的关联,得到第一语音指令“波放乐”对应的用户意图为第二语音指令“播放音乐”对应的用户意图“播放音乐”,手机播放音乐。
上述场景一和场景二可以对话过程中实现,也可以在特定的对话界面或设置界面上实现,本申请实施例对此不作具体限定。
在本申请实施例中,对语音助手意图理解能力拓展不涉及通过繁琐的UI界面操作新增自定义语义,也不涉及通过API(Application Programming Interface)设计/调用扩充语义接口,而是直接通过人机语音交互完成。用户使用执行本申请实施例提供的语音交互方法的语音助手,使用过程无技术门槛,用户交互体验更为自然。同时,在拓展语音助手意图理解能力的过程,不涉及人工运营,开发维护成本更低,迭代周期短。
可以理解,本申请实施例提供的语音交互方法,不限于单电子设备的语音交互,在第一 电子设备将第一语音指令与第二语音指令建立关联后,第一电子设备可以将关联关系同步给其他电子设备。
下面将描述如何根据第二语音指令(标准语音指令)与第一语音指令(非标准语音指令)之间的关联关系改善语音助手对非标准语音指令的识别。
在本申请实施例中,标准语音指令与非标准语音指令的关联关系可以为标准语音指令与非标准语音指令的等同关系。其中,标准语音指令与非标准语音指令之间的等同关系可以如下表1所示。
非标准语音指令 标准语音指令
飚个歌吧 播放音乐
Movie走起 播放电影
打开PYQ 打开朋友圈
表1
在其中一种可能实现方式中,标准语音指令与非标准语音指令的关联关系可以为标准语音指令对应的识别结果(如用户意图)与非标准语音指令的关联关系。其中标准语音指令对应的识别结果(如用户意图)与非标准语音指令的关联关系可以如下表2所示。
非标准语音指令 标准语音指令的用户意图
飚个歌吧 播放音乐
Movie走起 播放电影
打开PYQ 打开朋友圈
表2
方式1,语音助手根据标准语音指令与非标准语音指令之间的关联关系更新关联表,并根据更新后的关联表改善语音助手对非标准语音指令的识别。
在本申请实施例中,可以构建一个数据库,该数据库存储关联表。由语音助手维护该关联表,该关联表描述非标准语音指令(包含非标准句式/关键词、和/或歧义指令)与标准语音指令之间的关联关系,即非标准语音指令与标准指令之间等效映射关系。该关联表可以为上述表1。
如图10所示,在语音助手的ASR模块22内挂载关联表,由ASR模块22更新并使用该关联表。在图3步骤S307中,电子设备的语音助手建立第一语音指令与第二语音指令的关联关系后,语音助手根据第一语音指令与第二语音指令的关联关系更新非标准语音指令与标准语音指令的关联表,即在关联表中非标准语音指令内填入第一语音指令“飚个歌吧”,在关联表的标准语音指令处对应于第一语音指令“飚个歌吧”填入第二语音指令“播放音乐”,如表1,非标准语音指令“飚个歌吧”其映射为标准语音指令“播放音乐”。
在语音助手更新完关联表后,用户再次向电子设备发送第一语音指令“飚个歌吧”,电子设备接收到第一语音指令“飚个歌吧”后,语音助手的ASR模块22识别第一语音指令“飚个歌吧”,ASR模块22的语言模型输出第一语音指令文本“飚个歌吧”,并在语言模型处理阶段,语音模型可查阅关联表,根据关联表将第一语音指令“飚个歌吧”替换为关联的第二语音指令“播放音乐”,输出第二语音指令文本“播放音乐”给NLU模块23。NLU模块23对第二语音指令“播放音乐”进行处理。即在ASR模块22根据关联表将第一语音指令替换为第二语音指令之后,后续语音助手的NLU模块23和DM模块等处理流程则对第二语音指令“播放音乐”进行处理,而非对第一语音指令“飚个歌吧”处理。
可选的,上述关联表可以挂载在NLU模块23内,在NLU模块23对非标语音指令进行处理过程中,NLU模块23根据关联表将非标准语音指令替换为对应的标准语音指令。即将关联表从ASR模块22直接平移到NLU模块23前端,将即将输入到NLU模块23进行处理的非标准语音指令文本,替换为标准语音指令文本,具体实现方式同ASR模块22,在此不再赘述。
可选的,上述关联表也可以关联到DM模块24。例如,将关联表作为一个技能,或者置于一个特定的技能里,若语音助手的模块无法识别某一语音指令时,DM模块会调用该技能,确认该语音指令是否与其他语音指令关联,或者该语音指令是否已关联特定的操作,若是,DM模块再根据关联的语音指令对应的响应执行操作,或者直接执行匹配到的已关联的特定操作。
可选的,上述语音指令替换的处理流程,也可以在其它语音识别算法或流程中进行,本申请实施例对此不作具体限定。
以关联表挂载在ASR模块22为例,语音助手更新关联表可以通过以下方式:
第一方式:
请参阅图11,语音助手维护的技能清单中包括关联技能。关联技能用于将当前处理的语音指令与上一条语音指令进行关联。具体地,在上述图3的步骤S305中,用户使用第二种方式发送第二语音指令,即用户向电子设备输入语音“我的意思是播放音乐”,ASR模块22将该语音输入转换为文本“我的意思是播放音乐”,并将其输入到NLU模块23。NLU模块23从该语音对应的文本中提取用户意图,从文本“播放音乐”提取出用户意图1(即“播放音乐”),从文本“我的意思是”提取出用户意图2(即对第一语音指令进行解释),并将所提取的意图数据传递给DM模块24。DM模块24根据用户意图1和用户意图2,分别调用其所对应的技能。即根据用户意图1调用音乐播放控制技能,根据用户意图2调用关联技能。音乐播放控制技能和关联技能分别通过调用所对应的服务接口,执行对应的操作。如音乐播放控制技能调用播控服务,语音助手输出执行指令给电子设备,电子设备根据该执行指令播放音乐。关联技能调用关联服务,语音助手记录第一语音指令“飚个歌吧”和第二语音指令“播放音乐”,在关联表中非标准语音指令内填入第一语音指令“飚个歌吧”,在关联表的标准语音指令处对应于第一语音指令“飚个歌吧”填入第二语音指令“播放音乐”,由此实现标准语音指令(第二语音指令)与非标准语音指令(第一语音指令)之间的关联。
在本申请实施例中,在该关联表为上述表2时,则该关联表可以挂载在NLU模块23,在NLU模块23工作时,从关联表中找到该第一语音指令“飚个歌吧”的用户意图为“播放音乐”,则NLU模块23输出第一语音指令“飚个歌吧”的用户意图为“播放音乐”。
第二方式:
请参阅图12,ASR模块22包括一检测模块。检测模块用于检测用户输入的语音是否包括预设模板,若包括预设模板,从该语音输入中提取出用于解释或复述未识别语音指令的标准语音指令,并将其传输给NLU模块23。检测模块还用于将提取出的标准语音指令与非标准语音指令进行关联。
具体地,在上述图3的步骤S305中,用户使用第二种方式发送第二语音指令,即用户向电子设备输入语音“我的意思是播放音乐”,ASR模块22可正确将该语音输入转换为文本“我的意思是播放音乐”。语言模型将文本“我的意思是播放音乐”输入到检测模块。检测模块通过检测或匹配预设模板“我的意思是”,以识别当前语音输入是否包括预设模板“我的意思是”。 若检测到该语音输入中存在预设模板,则检测模块可以确定当前语音输入涉及复述或解释第一语音指令“飚个歌吧”,则检测模块从文本“我的意思是播放音乐”中提取用于复述或解释第一语音指令的第二语音指令“播放音乐”,即语音输入文本中预设模板“我的意思是”后面的文本。检测模块将所提取的第二语音指令传递给后续NLU模块23和DM模块24,以识别第二语音指令对应的用户意图“播放音乐”,调用“音乐播放控制技能”,音乐播放控制技能调用播控服务,语音助手输出执行指令给电子设备,电子设备根据该执行指令播放音乐。检测模块更新关联表,在关联表中非标准语音指令内填入第一语音指令“飚个歌吧”,在关联表的标准语音指令处对应于第一语音指令“飚个歌吧”填入第二语音指令“播放音乐”,由此实现标准语音指令(第二语音指令)与非标准语音指令(第一语音指令)之间的关联。
需要说明的是,检测模块可以作为ASR模块22的一部分,也可以部署在ASR模块22外部,也可以置于ASR模块22之后,不限定检测模块的部署位置和形式。
可选地,在用户使用第一种方式发送第二语音指令时,语音助手也可以默认第一语音指令“飚个歌吧”和第二语音指令“播放音乐”关联,在关联表中非标准语音指令内填入第一语音指令“飚个歌吧”,在关联表的标准语音指令处对应于第一语音指令“飚个歌吧”填入第二语音指令“播放音乐”。
可选地,在用户使用第三种方式发送第二语音指令时,电子设备接收到触发指令,将该触发指令发送给语音助手,语音助手根据该触发指令关联第一语音指令“飚个歌吧”和第二语音指令“播放音乐”,在关联表中非标准语音指令内填入第一语音指令“飚个歌吧”,在关联表的标准语音指令处对应于第一语音指令“飚个歌吧”填入第二语音指令“播放音乐”。
方式2,语音助手根据标准语音指令与非标准语音指令之间的关联关系构建训练数据,根据训练数据训练语音助手。
在本申请实施例中,可以构建一个数据库,该数据库存储关联表,该关联表可以为上述表2,根据该关联表编制训练数据,对语音助手的ASR模块22和/或NLU模块23进行训练,如增量学习训练,使得训练后的ASR模块22和NLU模块23能够支持识别非标准语音指令。即将非标准语音指令添加到训练样本中,重新训练ASR模块22和/或NLU模块23对应的网络模型。
可以理解,方式2并不要求积累一定数量的数据再进行训练,只需要根据标准语音指令与非标准语音指令的关联关系,提取非标准语音指令关联的标准语音指令所对应的用户意图,构建训练数据包括非标语音指令与其对应的用户意图,并将其新增到训练数据中,就可以完成上述增量学习。
请参阅图13,根据用户自定义的非标准语音指令“飚个歌吧”和标准语音指令“播放音乐”之间的关联关系,构造包含用户自定义的非标准语音指令(采用非标准句式、非标准关键词或模糊歧义的语音指令)及其对应意图、词槽、垂域等信息的训练数据集。即根据关联表构建训练集,例如原来有1000条语音指令-意图的训练数据用于训练NLU模块23,该1000条训练数据未覆盖非标准语音指令“飚个歌吧”,根据非标准语音指令“飙个歌吧”对应的标准语音指令为“播放音乐”,得到其对应的用户意图为“播放音乐”。现在新增一条训练数据“指令:飙个歌吧,意图:播放音乐,词槽:null”。使用新增的训练数据对NLU模块23进行重训练或增量训练,在训练完成后,NLU模块23就能够支持识别非标准语音指令“飚个歌吧”。由此可拓展语音助手意图理解能力。
在一种可能实现方式中,可进一步采用少样本学习、增量学习等端侧学习技术,挖掘用 户语言习惯,更深入地增量训练语音助手ASR模块22和NLU模块23中的一种或多种,进一步提升语音助手意图理解能力和交互自然性。
以基于增量学习和生成式对抗网络(Generative Adversarial Networks,GAN)训练更新NLU模块23为例。
若语音助手记录的用户自定义的非标准语音指令数量较少,不足以支撑NLU模块23的增量学习,可采用GAN网络批量生成与已记录用户自定义的非标准语音指令风格相同的用户自定义非标准语音指令。
请参阅图14,可首先使用少量标注数据(包括已记录的用户自定义的非标准语音指令及其对应的等效标准语音指令)调优生成网络和分类网络,使其挖掘学习用户语言习惯。然后,将标准语音指令输入至生成网络,批量生成对应的用户自定义的非标准语音指令,得到标准语音指令与非标语音指令的标注的数据对,使其覆盖各种场景且符合用户语音习惯。最后将所生成的标注数据用于NLU模块的增量学习。
可选的,所述生成网络可基于BERT、GPT-3等预训练模型构建,本申请实施例对此不作具体限定。
可以理解,也可以采用少样本学习技术训练语音助手,具体实现原理参照上述,对此不再赘述。
在本申请实施例中,采用少样本学习、增量学习等端侧学习技术,可使得语音助手进一步拓展支持用户未“解释”过或无记录历史的非标准句式/关键词、和/或歧义指令。即不仅可以支持用户“解释”过的非标准句式/关键词、和/或歧义指令,还可以通过生成训练数据集或少样本学习,支持用户未使用或“解释”过的非标准句式/关键词、和/或歧义指令,使得语音助手由机械地拓展支持用户自定义的非标准语音指令,变为挖掘学习用户语音指令中隐含的语言习惯。
示例性地,用户未使用非标准语音指令“打开PYQ”与手机交互过,基于增量学习和生成式对抗网络(Generative Adversarial Networks,GAN)训练更新NLU模块23后,语音助手根据用户的非标准语音指令“飚个歌”挖掘出用户的语言习惯,语音助手可以识别非标准语音指令“打开PYQ”对应的用户意图为“打开朋友圈”。
在本申请实施例中,除对本地语音助手的功能模块进行训练更新外,还可采用联邦学习、数据众包等技术挖掘并学习群体用户的非标准语音指令信息,从而使得语音助手能够更快地适配热词、流行事件等。
在一种可能实现方式中,用户主动或被动上传自定义的非标准语音指令及其对应等效标准语音指令信息,和/或其特征信息,和/或其关联的算法模型训练更新信息至云服务器。云服务器在获取上述信息中的一种或多种后,可通过提取其中的共性信息,训练更新语音助手的公共ASR模块22和NLU模块23,即所有用户默认搭载的ASR算法和NLU算法。云服务器下发训练更新后的ASR和/或NLU算法至用户端,可以通过更新版本方式,将训练更新后的语音助手APP下发至群体用户。
在本申请实施例中,不仅可以学习适配个体用户的非标准语音指令,还可以挖掘适配群体用户的非标准语音指令,从而进一步提升语音助手的运营适配效率。
本申请实施例提供一种计算机可读存储介质,计算机可读存储介质包含用于执行上述任一项的方法的计算机可执行指令。
本申请实施例提供一种系统,系统包括:第二方面提供的计算机可读存储介质;和能够 执行计算机可执行指令的处理器。
本申请实施例提供一种电子设备,包括:至少一个存储器,用于存储程序;和至少一个处理器,用于执行存储器存储的程序,当程序被处理器执行时,以使得电子设备执行如上任一的方法。
虽然已经示出并描述了本发明构思的一些示例实施例,但是本领域普通技术人员之一将理解,在不脱离由所附权利要求限定的精神和范围的情况下,可对其作出各种形式和细节上的修改。因此,以上公开的主题内容应该理解为示出性而非限制性的,并且所附权利要求旨在覆盖落入本发明构思的实质精神和范围内的所有这种修改、改进和其它实施例。因此,在法律允许的最大程度内,通过对所附权利要求及其等同物的允许的最宽解释确定本发明构思的范围,并且所述范围不应由以上具体实施方式限制或局限。
上述各个附图对应的流程的描述各有侧重,某个流程中没有详述的部分,可以参见其他流程的相关描述。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。实现车牌号码识别的计算机程序产品包括一个或多个进行车牌号码识别的计算机指令,在计算机上加载和执行这些计算机程序指令时,全部或部分地产生按照本申请实施例图3的流程或功能。
所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如:同轴电缆、光纤、数据用户线(digital subscriber line,DSL))或无线(例如:红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如:软盘、硬盘、磁带)、光介质(例如:数字通用光盘(digital versatile disc,DVD))、或者半导体介质(例如:固态硬盘(solid state disk,SSD))等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (19)

  1. 一种语音交互方法,其特征在于,所述方法包括:
    接收第一语音指令,所述第一语音指令无法被有效识别;
    接收第二语音指令,并建立所述第二语音指令与所述第一语音指令的关联关系,所述第二语音指令对应第一响应;
    接收第三语音指令,其中,所述第三语音指令与所述第一语音指令的内容或发音相同;
    响应于所述第三语音指令,执行与所述第二语音指令相同的所述第一响应。
  2. 根据权利要求1所述的语音交互方法,其特征在于,所述接收第二语音指令包括:
    在无法生成所述第一语音指令的识别结果时,建立与所述第一语音指令相关联的学习会话;
    在所述学习会话期间,接收第二语音指令。
  3. 根据权利要求2所述的语音交互方法,其特征在于,所述建立所述第一语音指令与所述第二语音指令的关联关系包括:
    检测到在所述学习会话期间接收到所述第二语音指令,建立所述第一语音指令与所述第二语音指令的关联关系。
  4. 根据权利要求2或3所述的语音交互方法,其特征在于,所述建立所述第一语音指令与所述第二语音指令的关联关系包括:
    检测触发指令;
    在检测到所述触发指令时,将在所述学习会话期间接收到的所述第二语音指令与第一语音指令建立关联关系。
  5. 根据权利要求2至4任一项所述的语音交互方法,其特征在于,所述在所述学习会话期间,接收第二语音指令包括:
    在所述学习会话期间,接收语音输入,其中所述语音输入包括第二语音指令和用于解释所述第一语音指令的第一语音内容;
    则所述建立所述第一语音指令与所述第二语音指令的关联关系包括:
    在检测到所述第一语音内容时,建立所述第二语音指令与所述第一语音指令的关联关系。
  6. 根据权利要求5所述的语音交互方法,其特征在于,所述第一语音内容为预设模板。
  7. 根据权利要求2至6任一项所述的语音交互方法,其特征在于,在所述接收第二语音指令之前,还包括:
    输出反馈以引导用户继续输入语音指令。
  8. 根据权利要求1所述的语音交互方法,其特征在于,在所述接收第二语音指令之前还包括:
    响应于所述第一语音指令,执行第二响应,其中所述第二响应不同于所述第一响应。
  9. 根据权利要求7所述的语音交互方法,其特征在于,所述建立所述第一语音指令与所述第二语音指令的关联关系包括:
    检测触发指令;
    在检测到所述触发指令时,建立所述第二语音指令与第一语音指令的关联关系。
  10. 根据权利要求7所述的语音交互方法,其特征在于,所述接收第二语音指令包括:
    接收用户的语音输入,其中所述语音输入包括第二语音指令和用于指示所述第一语音指令的识别结果存在错误的第二语音内容。
  11. 根据权利要求9所述的语音交互方法,其特征在于,所述建立所述第一语音指令与所述第二语音指令的关联关系包括:
    在检测到所述第二语音内容时,建立所述第二语音指令与所述第一语音指令的关联关系。
  12. 根据权利要求10所述的语音交互方法,其特征在于,所述第二语音内容为预设模板。
  13. 根据权利要求1至11任一项所述的语音交互方法,其特征在于,所述建立所述第一语音指令与所述第二语音指令的关联关系包括:
    将所述第一语音指令等同于所述第二语音指令,或,将所述第二语音指令的第一响应与所述第一语音指令相关联。
  14. 根据权利要求1至12任一项所述的语音交互方法,其特征在于,所述语音交互方法还包括:
    根据所述关联关系生成训练数据集;
    将所述训练数据集用于训练语音助手的模型,以使得所述语音助手能处理适配用户语言习惯的语音指令。
  15. 根据权利要求13所述的语音交互方法,其特征在于,所述根据所述关联关系生成训练数据集包括:
    将所述关联关系上传至云服务器;
    接收群体用户上传的所述关联关系,以生成适配群体用户语言习惯的训练数据集。
  16. 根据权利要求1至15任一项所述的语音交互方法,其特征在于,所述方法还包括:
    接收第四语音指令,其中所述第四语音指令与所述第一语音指令的内容或发音不完全相同,所述第四语音指令与所述第一语音指令的内容或发音的相似度在第一范围内;
    响应于所述第四语音指令,执行与所述第二语音指令相同的所述第一响应。
  17. 一种计算机可读存储介质,所述计算机可读存储介质包含用于执行根据权利要求1至16中任一项所述的方法的计算机可执行指令。
  18. 一种系统,所述系统包括:
    根据权利要求17所述的计算机可读存储介质;和
    能够执行所述计算机可执行指令的处理器。
  19. 一种电子设备,其特征在于,包括:
    至少一个存储器,用于存储程序;和
    至少一个处理器,用于执行所述存储器存储的程序,当所述程序被所述处理器执行时,以使得所述电子设备执行如权利要求1-16任一所述的方法。
PCT/CN2022/115934 2021-09-18 2022-08-30 语音交互方法及电子设备 WO2023040658A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111101013.4A CN115841814A (zh) 2021-09-18 2021-09-18 语音交互方法及电子设备
CN202111101013.4 2021-09-18

Publications (1)

Publication Number Publication Date
WO2023040658A1 true WO2023040658A1 (zh) 2023-03-23

Family

ID=85574274

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115934 WO2023040658A1 (zh) 2021-09-18 2022-08-30 语音交互方法及电子设备

Country Status (2)

Country Link
CN (1) CN115841814A (zh)
WO (1) WO2023040658A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117234341A (zh) * 2023-11-15 2023-12-15 中影年年(北京)文化传媒有限公司 基于人工智能的虚拟现实人机交互方法及系统

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110110502A1 (en) * 2009-11-10 2011-05-12 International Business Machines Corporation Real time automatic caller speech profiling
CN105027197A (zh) * 2013-03-15 2015-11-04 苹果公司 训练至少部分语音命令系统
CN106123066A (zh) * 2016-08-06 2016-11-16 广东万家乐燃气具有限公司 一种带自学习功能的语音控制系统及吸油烟机
CN106537491A (zh) * 2014-11-24 2017-03-22 奥迪股份公司 带有操作校正的机动车仪器操作
CN106558307A (zh) * 2015-09-17 2017-04-05 三星电子株式会社 智能对话处理设备、方法和系统
CN108877792A (zh) * 2018-05-30 2018-11-23 北京百度网讯科技有限公司 用于处理语音对话的方法、装置、电子设备以及计算机可读存储介质
CN110663079A (zh) * 2017-05-24 2020-01-07 乐威指南公司 基于语音纠正使用自动语音识别生成的输入的方法和系统
CN113096653A (zh) * 2021-03-08 2021-07-09 谭维敏 一种基于人工智能的个性化口音语音识别方法及系统

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110110502A1 (en) * 2009-11-10 2011-05-12 International Business Machines Corporation Real time automatic caller speech profiling
CN105027197A (zh) * 2013-03-15 2015-11-04 苹果公司 训练至少部分语音命令系统
CN106537491A (zh) * 2014-11-24 2017-03-22 奥迪股份公司 带有操作校正的机动车仪器操作
CN106558307A (zh) * 2015-09-17 2017-04-05 三星电子株式会社 智能对话处理设备、方法和系统
CN106123066A (zh) * 2016-08-06 2016-11-16 广东万家乐燃气具有限公司 一种带自学习功能的语音控制系统及吸油烟机
CN110663079A (zh) * 2017-05-24 2020-01-07 乐威指南公司 基于语音纠正使用自动语音识别生成的输入的方法和系统
CN108877792A (zh) * 2018-05-30 2018-11-23 北京百度网讯科技有限公司 用于处理语音对话的方法、装置、电子设备以及计算机可读存储介质
CN113096653A (zh) * 2021-03-08 2021-07-09 谭维敏 一种基于人工智能的个性化口音语音识别方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117234341A (zh) * 2023-11-15 2023-12-15 中影年年(北京)文化传媒有限公司 基于人工智能的虚拟现实人机交互方法及系统
CN117234341B (zh) * 2023-11-15 2024-03-05 中影年年(北京)科技有限公司 基于人工智能的虚拟现实人机交互方法及系统

Also Published As

Publication number Publication date
CN115841814A (zh) 2023-03-24

Similar Documents

Publication Publication Date Title
CN110634483B (zh) 人机交互方法、装置、电子设备及存储介质
CN109074806B (zh) 控制分布式音频输出以实现语音输出
WO2021041517A1 (en) Customizable keyword spotting system with keyword adaptation
WO2021008538A1 (zh) 语音交互方法及相关装置
CN109643548A (zh) 用于将内容路由到相关联输出设备的系统和方法
CN112840396A (zh) 用于处理用户话语的电子装置及其控制方法
WO2020073248A1 (zh) 一种人机交互的方法及电子设备
WO2019242414A1 (zh) 语音处理方法、装置、存储介质及电子设备
CN108648754B (zh) 语音控制方法及装置
US20240005918A1 (en) System For Recognizing and Responding to Environmental Noises
CN110706707B (zh) 用于语音交互的方法、装置、设备和计算机可读存储介质
WO2014173325A1 (zh) 喉音识别方法及装置
WO2023040658A1 (zh) 语音交互方法及电子设备
US10923123B2 (en) Two-person automatic speech recognition training to interpret unknown voice inputs
WO2022147692A1 (zh) 一种语音指令识别方法、电子设备以及非瞬态计算机可读存储介质
CN109670025A (zh) 对话管理方法及装置
WO2023202442A1 (zh) 唤醒设备的方法、电子设备和存储介质
EP4293664A1 (en) Voiceprint recognition method, graphical interface, and electronic device
CN109102812B (zh) 一种声纹识别方法、系统及电子设备
WO2019150708A1 (ja) 情報処理装置、情報処理システム、および情報処理方法、並びにプログラム
CN111739528A (zh) 一种交互方法、装置和耳机
KR20210044606A (ko) 웨이크업 모델 생성 방법 및 이를 위한 전자 장치
CN117953872A (zh) 语音唤醒模型更新方法、存储介质、程序产品及设备
WO2023231936A1 (zh) 一种语音交互方法及终端
US11694684B1 (en) Generation of computing functionality using devices

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22869030

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE