WO2023246894A1 - 语音交互方法及相关装置 - Google Patents

语音交互方法及相关装置 Download PDF

Info

Publication number
WO2023246894A1
WO2023246894A1 PCT/CN2023/101818 CN2023101818W WO2023246894A1 WO 2023246894 A1 WO2023246894 A1 WO 2023246894A1 CN 2023101818 W CN2023101818 W CN 2023101818W WO 2023246894 A1 WO2023246894 A1 WO 2023246894A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
electronic device
intention
list
user
Prior art date
Application number
PCT/CN2023/101818
Other languages
English (en)
French (fr)
Inventor
耿杰
柴海水
赵伟
金洪宾
孙思聪
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023246894A1 publication Critical patent/WO2023246894A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Definitions

  • the present application relates to the field of terminal technology, and in particular to voice interaction methods and related devices.
  • a user can give the device a voice command to "play music.”
  • the device can play music after recognizing the voice command.
  • every time the user gives a voice command to the device he needs to wake up the voice interaction application in the device through the wake-up word, and then speak the voice command. This results in an unsmooth voice interaction process between the user and the device.
  • the user needs to frequently speak wake words to achieve the purpose of voice controlling the device, resulting in a poor user experience.
  • This application provides voice interaction methods and related devices.
  • the above method can provide users with a full-time wake-up-free voice interaction experience on the basis of saving power consumption of electronic devices. Users can issue voice commands at any time to instruct electronic devices to perform corresponding operations without waking up the voice assistant.
  • this application provides a voice interaction method.
  • This method is applied to electronic devices.
  • Electronic devices include voice assistants.
  • the electronic device can receive the first voice when the voice assistant is in a sleep state.
  • the electronic device may determine that the first voice matches the first intention in the first list, and the first list contains one or more intentions corresponding to the voice instructions.
  • the electronic device can perform the operation corresponding to the first intention.
  • Electronic devices can wake up voice assistants.
  • the voice assistant is in the awake state, the electronic device can receive the second voice.
  • the electronic device can recognize the second intention in the second voice and perform an operation corresponding to the second intention.
  • the above-mentioned first list may be an execution intention list in this application.
  • the intentions contained in the first list may be called execution intentions.
  • the first list may include intentions corresponding to the user's frequently used voice commands.
  • the above-mentioned common voice commands may include voice commands with high frequency of use, low misrecognition rate, and no ambiguity.
  • the above-mentioned misrecognition rate may refer to the probability of misrecognizing the user's voice that does not contain a voice command as a voice command. This allows users to directly issue common voice commands to control the electronic device to perform corresponding operations without performing a wake-up operation.
  • neither the first voice nor the second voice contains a wake-up word for waking up the voice assistant.
  • the electronic device may include a first speech recognition model and a second speech recognition model.
  • the size of the second speech recognition model is larger than the size of the first speech recognition model.
  • the size of the first speech recognition model and the size of the second speech recognition model may refer to the size of the storage space required by the speech recognition model.
  • the larger the size of the speech recognition model the higher the computing power of the speech recognition model.
  • Computing power can represent the ability of a speech recognition model to process and calculate data. That is, the computing power of the second speech recognition model is higher than the size of the first speech recognition model.
  • the lower the computing power of the speech recognition model the lower the power consumption of the speech recognition model and the fewer computing resources required.
  • the power consumption of the second speech recognition model is higher than the power consumption of the first speech recognition model.
  • the second speech recognition model requires more computing resources than the first speech recognition model. Among them, the lower the computing power of the speech recognition model, the smaller the number of parameters used by the speech recognition model may be. That is, the number of parameters used by the second speech recognition model is greater than the number of parameters used by the first speech recognition model.
  • the electronic device can run the first speech recognition model in real time. Wherein, the electronic device may use the first speech recognition model to determine that the first speech matches the first intention in the first list.
  • the electronic device can run the second speech recognition model while the voice assistant is in the awakened state. Wherein, the electronic device can use the second speech recognition model to recognize the second intention in the second speech. When the electronic device uses the second speech recognition model to recognize the intention in the received speech, it is not necessary to use the above-mentioned first list.
  • the electronic device may also switch the voice assistant from the wake state to the sleep state if no voice is received within the first period of time.
  • the above-mentioned first time period may be a period of time starting from the last time the electronic device receives voice when the voice assistant is in the awake state, and the duration is a preset time period (such as 5 seconds, 10 seconds, etc.).
  • the above-mentioned first time period may be a period of time starting from the time when the electronic device last recognized a voice command from the received voice when the voice assistant is in the awake state, and the duration is a preset time period.
  • the above-mentioned first time period may be a period of time that starts when the electronic device last performs a corresponding operation in response to a received voice command when the voice assistant is in the awake state, and the duration is a preset time period.
  • the electronic device does not detect any voice in the environment within a period of time after receiving the second voice.
  • the above-mentioned first period of time may be a period of time starting from the moment when the electronic device receives the second voice and having a duration of a preset duration.
  • the electronic device does not detect a voice in the environment after performing an operation corresponding to the second intention in the second voice.
  • the above-mentioned first time period may be a period of time with the time when the electronic device completes the operation corresponding to the second intention as the starting time, and the duration is a preset time period.
  • the above embodiments can prevent the electronic device from running a high-computing power speech recognition model for a long time and consume too much power when the user does not issue a voice command after the voice assistant is awakened, thereby saving the power consumption of the electronic device.
  • the first list corresponds to a first sentence list and a first entity list
  • the first sentence list includes one or more sentence patterns
  • the first entity list includes one or more entities
  • One or more intentions in the first list are composed of sentence patterns in the first sentence list and entities in the first entity list.
  • the electronic device can receive the third voice when the voice assistant is in the sleep state.
  • the electronic device may determine that the sentence pattern of the third voice matches the first sentence pattern in the first sentence list, and there is no entity matching the first entity of the third voice in the first entity list.
  • Electronic devices can wake up voice assistants. When the voice assistant is in the awake state, the electronic device can recognize the third intention in the third voice and perform an operation corresponding to the third intention.
  • the third intention consists of the first sentence and the first entity.
  • the intentions in the first list can be divided into substantive intentions and non-substantial intentions according to whether there is an entity in the intention.
  • An entity can refer to a specific instance of a category of things.
  • the object category corresponding to the entity may include one or more of the following: song title, singer name, place name, movie title, TV series title, book title, train number, flight number, phone number, email address, etc.
  • the thing categories corresponding to the above entities can also be called entity categories.
  • a substantive intention is an intention that contains an entity.
  • the entity intention can be composed of sentences and entities.
  • Sentences can contain sentence body structures and entity placeholders. Entity placeholders are used to determine where in a sentence the entity is placed. Sentences with entity intent can support placing any entity under the same thing category at the location of the entity placeholder.
  • a disembodied intention is an intention that does not contain an entity.
  • the electronic device can use the first speech recognition model to determine that the sentence pattern of the third voice matches the first sentence pattern in the first sentence list, and there is no sentence pattern matching the third voice in the first entity list. An entity matches the entity. Then, when the voice assistant is in the awake state, the electronic device can use the second speech recognition model to recognize the third intention in the third voice.
  • the electronic device can still respond to the voice to perform the operation corresponding to the voice command issued by the user.
  • the above method can better provide users with a full-time wake-up-free voice interaction experience.
  • the electronic device when it is determined that the sentence pattern of the third voice matches the first sentence pattern in the first sentence list, and there is no first entity in the first entity list that matches the third voice
  • the electronic device can prompt the user to repeat the above third voice (for example, the electronic device can voice broadcast "I didn't hear it clearly, please say it again") and wake up the voice assistant.
  • the user can repeat the above third voice according to the prompts of the electronic device.
  • the voice assistant is in the awake state, the electronic device can receive the user's voice repeating the third voice, and use the second voice recognition model to recognize the voice and identify the third intention in the voice. Then, the electronic device can perform the operation corresponding to the third intention.
  • the electronic device when it is determined that the sentence pattern of the third voice matches the first sentence pattern in the first sentence list, and there is no first entity in the first entity list that matches the third voice For matching entities, the electronic device can also add the first entity of the third voice to the first entity list. In this way, when the user speaks the same voice as the third voice again, the electronic device can use the first voice recognition model to determine that the voice matches the intention in the first list when the voice assistant is in the sleep state, thereby directly executing the intention. corresponding operation.
  • the electronic device can also adjust the first entity list through self-learning during the voice interaction process, so that the first entity list contains more entities commonly used by users, thereby making the intentions contained in the first list more specific. Corresponds to common voice commands close to the user, improving the user experience of voice interaction with electronic devices.
  • the electronic device after adding the first entity of the third voice to the first entity list, receives the fourth voice when the voice assistant is in the sleep state.
  • the electronic device may determine that the sentence pattern of the fourth voice matches the first sentence pattern in the first sentence list, and the entity of the fourth voice matches the first entity in the first entity list, wherein the fourth voice and the third Intent matching.
  • the electronic device can perform operations corresponding to the third intention.
  • Electronic devices can wake up voice assistants.
  • the electronic device can add the first entity to the first entity list.
  • the above-mentioned first entity and the first sentence may constitute the above-mentioned third intention.
  • the electronic device adds the first entity to the first entity list, it may be equivalent to adding the above-mentioned third intention to the first list.
  • the user can directly issue voice commands corresponding to the third intention to the electronic device without first waking up the voice assistant.
  • the electronic device can also wake up the voice assistant in addition to performing the operation corresponding to the third intention.
  • the user can further issue more voice commands to the electronic device, thereby conducting multiple rounds of voice interaction with the electronic device without performing a wake-up operation.
  • the electronic device can receive the fifth voice when the voice assistant is in the sleep state.
  • the electronic device may determine that the fifth voice matches a fourth intention in the second list, one intention in the second list is associated with one or more intentions in the first list, wherein the fourth intention is associated with a fourth intention in the first list.
  • the electronic device may provide a first prompt for prompting the user to speak a voice that matches the fifth intention.
  • the above-mentioned second list may be the extended intention list in the embodiment of the present application.
  • the intentions in the second list may be called extended intentions.
  • the second list may include the intentions of voice correspondences that the user said when expressing common voice commands that are indirect and have a high misrecognition rate.
  • the electronic device may detect, according to the second list, whether the received voice matches the extended intention in the second list. Matching the voice spoken by the user with the extended intent may indicate that there is doubt in the voice spoken by the user.
  • the electronic device may provide the user with the above-mentioned first prompt according to the execution intention associated with the extended intention to confirm whether the user wants to implement the execution intention associated with the extended intention.
  • the electronic device After determining that the user wants to implement the execution intention associated with the extended intention, the electronic device can perform an operation corresponding to the execution intention and perform voice interaction with the user.
  • the above-mentioned embodiment can realize that without waking up the voice assistant, it will neither miss the recognition of the voice instructions that the user may give nor respond incorrectly to the non-voice instructions spoken by the user, thereby improving the user's voice interaction experience.
  • the electronic device after the electronic device provides the first prompt, it also receives a sixth voice.
  • the electronic device may determine that the sixth voice matches the fifth intention, and perform an operation corresponding to the fifth intention.
  • Electronic devices can wake up voice assistants.
  • the electronic device can also respond to the user's doubtful voice and prompt the user to speak a more direct and unambiguous voice command (that is, the voice that matches the execution intention associated with the extended intention). , to determine whether the user has issued a voice command.
  • the user speaks a voice that matches the execution intention according to the above first prompt, which can indicate that the user wants to issue a voice command.
  • the electronic device can perform operations corresponding to the voice instructions issued by the user.
  • the above embodiments can reduce the failure to recognize voice commands that may be given by the user without waking up the voice assistant, and improve the user's voice interaction experience.
  • the electronic device after the electronic device provides the first prompt and does not receive a voice matching the fifth intention within the second time period, the electronic device can cancel the first prompt and keep the voice assistant in the sleep state. .
  • the above-mentioned first prompt may display text information corresponding to the fifth intention in the user interface of the electronic device.
  • the electronic device can cancel the first prompt by canceling the display of the text information corresponding to the fifth intention on the user interface.
  • the first prompt may be a voice broadcast prompting the user to speak a voice that matches the fifth intention.
  • the electronic device can cancel the first prompt to stop the voice broadcast and prompt the user to speak the voice matching the fifth intention.
  • the above-mentioned second time period may be a period of time with the time when the electronic device provides the first prompt as the starting time and a preset time period.
  • the electronic device can keep the voice assistant asleep.
  • the above-mentioned embodiment can reduce the number of incorrect voice responses to non-voice instructions spoken by the user without waking up the voice assistant, and the above-mentioned first prompt will not cause too much interference to the user, which can improve the user's voice quality. interactive experience.
  • the first list includes a sixth intention.
  • the electronic device may remove the sixth intention from the first list and add the sixth intention to the second list.
  • the electronic device when the electronic device detects the voice matching the sixth intention, it can first confirm with the user whether to issue a voice command. When it is confirmed that the user has issued a voice command, the electronic device may perform an operation corresponding to the sixth intention.
  • the above method can reduce the misrecognition situation caused by treating non-voice commands as voice commands in scenarios where voice commands are issued without waking up the voice assistant, and improve the user experience of voice interaction with electronic devices.
  • this application provides a voice interaction method, which is applied to electronic devices.
  • Electronic devices include voice assistants. Wherein, the electronic device receives the first voice when the voice assistant is in a sleep state. In response to the first voice, the electronic device may provide the first prompt, The first prompt is used to prompt the user to speak the first instruction. The electronic device can receive the second voice, determine that the second voice matches the first instruction, and perform an operation corresponding to the first instruction.
  • neither the first voice nor the second voice contains a wake-up word for waking up the voice assistant.
  • the above-mentioned first prompt may be to display text information corresponding to the first instruction in the user interface of the electronic device.
  • the first prompt may be a voice broadcast prompting the user to speak a voice that matches the first instruction.
  • the electronic device may include a first speech recognition model and a second speech recognition model.
  • the size of the second speech recognition model is larger than the size of the first speech recognition model.
  • the size of the first speech recognition model and the size of the second speech recognition model may refer to the size of the storage space required by the speech recognition model.
  • the larger the size of the speech recognition model the higher the computing power of the speech recognition model.
  • Computing power can represent the ability of a speech recognition model to process and calculate data. That is, the computing power of the second speech recognition model is higher than the size of the first speech recognition model.
  • the lower the computing power of the speech recognition model the lower the power consumption of the speech recognition model and the fewer computing resources required.
  • the electronic device can run the second speech recognition model while the voice assistant is in the awakened state.
  • the above method of providing the first prompt in response to the first voice may specifically include: in response to the first voice, the electronic device may use the first voice recognition model to determine the difference between the first voice and the first voice. Command association. The electronic device may provide the first prompt based on the association between the first voice and the first instruction.
  • the electronic device may store the first list.
  • the above-mentioned first list may be the execution intention list in this application.
  • the first list may contain one or more intentions corresponding to the voice instructions.
  • the intentions contained in the first list may be called execution intentions.
  • the first list may include intentions corresponding to the user's frequently used voice commands.
  • the above-mentioned common voice commands may include voice commands with high frequency of use, low misrecognition rate, and no ambiguity.
  • the above-mentioned misrecognition rate may refer to the probability of misrecognizing the user's voice that does not contain a voice command as a voice command. This allows users to directly issue common voice commands to control the electronic device to perform corresponding operations without performing a wake-up operation.
  • the intention corresponding to the above-mentioned first instruction consists of a first sentence and a first entity.
  • the above-mentioned association between the first voice and the first instruction may mean: the sentence pattern of the first voice is the above-mentioned first sentence pattern, and the entity of the first voice is the above-mentioned first entity.
  • the first sentence list contains the first sentence.
  • the above-mentioned first entity is not included in the first entity list.
  • the present application provides a chip, which is applied to an electronic device.
  • the chip includes one or more processors, and the processor is used to call computer instructions to cause the electronic device to execute as in the first aspect or the second aspect. Any possible implementation method.
  • Figure 1 is a schematic structural diagram of an electronic device 100 provided by an embodiment of the present application.
  • Figure 3 is a framework diagram of a voice interaction system 30 provided by an embodiment of the present application.
  • FIGS. 5A and 5B are schematic diagrams of other voice interaction scenarios provided by embodiments of the present application.
  • FIGS. 7A and 7B are schematic diagrams of other voice interaction scenarios provided by embodiments of the present application.
  • Figure 9 is a schematic diagram of a method for adjusting an execution intention list provided by an embodiment of the present application.
  • the electronic device can implement a "once wake up, continuous conversation" voice interaction solution. Specifically, the electronic device can detect in real time whether the collected sound contains the wake-up word. When the wake word is detected, the electronic device can wake up the voice assistant and use the voice assistant to The hand performs intent recognition and action execution on speech collected after the wake word. For example, after the user said the wake-up word "Xiaoyi Xiaoyi", he further said the voice command "play music”. After the electronic device detects the wake-up word, it can wake up the voice assistant to recognize the above-mentioned voice command. When it is recognized that the intention corresponding to the above voice command is to play music, the electronic device can play the music.
  • the above-mentioned voice assistant is a voice interaction application.
  • the above-mentioned voice assistant can also be called a voice recognition application and other names.
  • the embodiments of the present application do not limit this.
  • the above-mentioned voice command may refer to the voice used to control the electronic device to perform one or more operations.
  • the electronic device after waking up the voice assistant, the electronic device can continuously detect human voices in the environment through the voice assistant, and perform intent recognition and action execution. When no human voice is detected within a preset period of time, the electronic device can exit the wake-up state of the voice assistant. After the voice assistant exits the wake-up state, it needs to wake up in response to the wake-up operation again.
  • the above-mentioned wake-up operation of waking up the voice assistant may include waking up through a wake-up word, waking up through physical buttons or virtual buttons on the electronic device, etc.
  • the embodiments of the present application do not limit the above-mentioned wake-up operation for waking up the voice assistant.
  • the electronic device can recognize the multiple voice commands and perform operations corresponding to the multiple voice commands. During the period when the above-mentioned user continuously issues multiple voice commands, the user does not need to say the wake-up word before issuing each voice command. After the user stops speaking within a preset time period, if the user wants to control the electronic device through voice commands again, he needs to speak the wake-up word again to wake up the voice assistant.
  • the user can continuously talk to the electronic device and realize multiple rounds of voice interaction with the electronic device. This can improve the fluency of voice interaction between users and electronic devices.
  • the user still needs to wake up the voice assistant first and then conduct voice interaction with the electronic device. Users still cannot control electronic devices by voice at any time without speaking a wake word or performing other wake-up operations. Users have a poor experience using the voice interaction function.
  • one or more fixed command words may be stored in the electronic device, such as pause playback, continue playback, previous song, next song, previous episode, next episode, etc.
  • the electronic device can perform an operation corresponding to the fixed command word. For example, an electronic device is playing music.
  • the electronic device can determine that the voice matches the fixed command word "pause play”. Then, the electronic device can pause the music that is currently playing. In this way, the user can issue voice commands to control the electronic device without performing a wake-up operation.
  • the fixed command words stored by the electronic device are usually limited.
  • the above fixed command words can usually be used in specified scenes, such as video playback scenes and music playback scenes.
  • the electronic device will be unable to respond to the user's voice command when the voice assistant is not awakened. In other words, when users issue voice commands that are outside the range of fixed command words, they still need to wake up the voice assistant first.
  • the execution intention list may be stored in the electronic device.
  • the execution intention list may include intentions corresponding to the user's frequently used voice commands.
  • the electronic device can run a low-computing speech recognition model when the voice assistant is not awakened to detect whether the received speech matches the intent in the execution intent list.
  • the electronic device can perform an operation corresponding to the voice-matched intention.
  • electronic devices can also wake up voice assistants and run high-computing speech recognition models to respond to subsequent voice commands issued by users. When it is detected that the received voice does not match the intent in the execution intent list, the electronic device can continue to run the low-computing speech recognition model without waking up the voice assistant.
  • the electronic device can run a low-computing power speech recognition model in real time to detect whether the user speaks common voice commands.
  • the electronic device can directly perform the operation corresponding to the common voice command. That is to say, users can directly issue some common voice commands to electronic devices without waking up the voice assistant first. And after issuing common voice commands, the user can further issue more voice commands to the electronic device, and conduct multiple rounds of voice interaction with the electronic device without performing a wake-up operation.
  • the electronic device may detect whether the received voice matches the extension intention.
  • the electronic device may prompt the user to speak the execution intention associated with the above-matched extended intention to confirm whether the user issues a voice command. Further, when receiving voice that matches the above execution intention, the electronic device can perform operations corresponding to the execution intention and wake up the voice assistant.
  • the electronic The device can also use the extended intent list to analyze the user's potentially questionable voice to determine whether the user wants to issue a voice command. After confirming that the user wants to issue a voice command, the electronic device can perform an operation corresponding to the voice command that the user wants to issue.
  • the above embodiments can improve the recognition rate of recognizing user voice commands without waking up the voice assistant, thereby improving the user experience of controlling electronic devices through voice in a full-time wake-up-free scenario.
  • the above-mentioned full-time wake-up-free function means that the user can issue voice commands at any time without first performing a wake-up operation to wake up the voice assistant.
  • the above-mentioned low computing power speech recognition model has low computing power level and low power consumption.
  • an electronic device runs a low-computing speech recognition model in real time without waking up the voice assistant, it usually does not produce excessive power consumption, and thus does not cause problems such as heating of the electronic device and lag in operation.
  • the voice interaction method provided by this application can save the power consumption of electronic devices on the basis of realizing full-time wake-up-free operation.
  • both the low-computing power speech recognition model and the high-computing power speech recognition model are models based on neural networks.
  • a neural network may include an input layer, a hidden layer, and an output layer, with each layer having one or more nodes.
  • low-computing power speech recognition models have fewer hidden layers and/or fewer nodes in the hidden layer.
  • the above-mentioned low computing power speech recognition model can be deployed on the terminal side, that is, on the electronic device.
  • the above-mentioned high-computing speech recognition model can be deployed on the device side or on the cloud side, that is, on the cloud server.
  • all processes of voice interaction can be completed on electronic devices.
  • the electronic device wakes up the voice assistant the electronic device can use the local high-computing power speech recognition model to perform speech recognition for voice interaction.
  • voice interaction can be completed through a solution that combines devices and clouds. Without waking up the voice assistant, the electronic device can use the local low-computing power speech recognition model for speech recognition to perform voice interaction.
  • the electronic device can communicate with the cloud server and use the high computing power speech recognition model on the cloud server to perform speech recognition for voice interaction.
  • the execution intention list may include intentions corresponding to the user's frequently used voice commands. Among them, intent can represent what the user wants to do. According to a piece of voice spoken by the user, the intention corresponding to this piece of voice can be identified to identify what the user wants to do when speaking this piece of voice. For example, when the user says "turn on the air conditioner", the user's intention is to hope that the electronic device can turn on the air conditioner. After the electronic device recognizes the user's intention to say “turn on the air conditioner", it can turn on the air conditioner.
  • the electronic device can quickly respond to common voice commands spoken by the user according to the execution intention list when the voice assistant is not awakened.
  • the above-mentioned common voice commands may include voice commands with high frequency of use, low misrecognition rate, and no ambiguity.
  • the above-mentioned misrecognition rate may refer to the probability of misrecognizing the user's voice that does not contain a voice command as a voice command.
  • the user may often instruct the car computer to open/close the windows, turn on/off the air conditioner, play music, adjust the volume, navigate, etc. through voice commands.
  • common voice commands can include opening the car window, closing the car window, turning on the air conditioner, turning off the air conditioner, playing song 1, playing the song of singer 1, turning up the system volume, navigating to location 1, etc.
  • the above Table 1 is only an exemplary description of the execution intention list in the embodiment of the present application, and should not limit the execution intention list.
  • the list of execution intentions can also contain more or fewer intentions.
  • the intentions in the above execution intention list can also be classified according to application scenarios. For example, intentions can be classified according to application scenarios into car control, settings, music, navigation, etc. "Open the car window”, “Close the car window”, “Turn on the air conditioner”, “Turn off the air conditioner” in Table 1 above may belong to the car control type of intentions. "System volume up” can belong to the setting class of intents. "Play song 1" and “play singer 1's song” may belong to music-type intentions. "Navigate to location 1" may belong to the navigation type of intent.
  • the electronic device when performing intent recognition, can use a speech recognition model (such as a low-computing power speech recognition model, a high-computing power speech recognition model) to first identify which category of intent the received speech corresponds to, Then the meaning of the speech expression is determined based on the keywords in the speech.
  • a speech recognition model such as a low-computing power speech recognition model, a high-computing power speech recognition model
  • the intentions in the above-mentioned execution intention list can be divided into substantial intentions and non-entity intentions according to whether there is an entity in the intention.
  • An entity can refer to a specific instance of a category of things.
  • the object category corresponding to the entity may include one or more of the following: song title, singer name, place name, movie title, TV series title, book title, train number, flight number, phone number, email address, etc.
  • specific examples of song titles may include song 1, song 2, song 3, and so on.
  • Specific examples of singer names may include singer 1, singer 2, singer 3, and so on.
  • Specific instances under a category of things include a larger range, which may range from a few to millions of entities.
  • the embodiments of this application do not limit the types of things corresponding to the above entities.
  • the thing categories corresponding to the above entities can also be called entity categories.
  • a substantive intention is an intention that contains an entity.
  • the entity intention can be composed of sentences and entities.
  • Sentences can contain sentence body structures and entity placeholders. Entity placeholders are used to determine where in a sentence the entity is placed.
  • Sentences with entity intent can support placing any entity under the same thing category at the location of the entity placeholder.
  • the sentence pattern with substantive intention can be “play [singer name]'s song”. Among them, "play the song of" is the main structure of the sentence. "[Singer Name]” is the entity placeholder for this sentence pattern. This entity placeholder is located between "play” and "song” in the main structure of the sentence.
  • the sentence pattern with entity intent can support placing any entity under the category of things named singer's name at the location of the entity placeholder. For example, if “Singer 2" is placed at the location of the entity placeholder, the entity intention is "Play singer 2's song”. If “Singer 3" is placed at the location of the entity placeholder, then there is an entity with the intention of "playing Singer 3's song”.
  • the above execution intention list can correspond to a sentence list and an entity list.
  • the sentence pattern list may include sentence patterns with entity intentions in the execution intention list.
  • the entity list may include entities with entity intents in the execution intent list.
  • the entities in the entity list can be classified according to the categories of things corresponding to the entities. For example, entities like song titles, entities like singer names, entities like place names, etc.
  • a disembodied intention is an intention that does not contain an entity.
  • all intentions other than entity intentions in the execution intention list are non-entity intentions. It can be seen that the words “open the car window”, “close the car window”, “turn on the air conditioner”, “turn off the air conditioner” and “turn up the system volume” in Table 1 above have no substantive intention.
  • the above execution intention list may be preset.
  • a voice assistant when installed on an electronic device, in addition to obtaining and storing a low-computing power speech recognition model, it can also obtain and store a preset execution intention list.
  • the electronic device can also adjust the execution intention list through self-learning during the voice interaction process, so that the intentions contained in the execution intention list are closer to the user's common voice instructions, thereby improving the user's voice interaction with the electronic device. usage experience.
  • the implementation process of the electronic device self-learning to adjust the execution intention list will be introduced in subsequent embodiments and will not be described here.
  • the electronic device can also receive the user's operation to adjust the execution intention list to adjust the execution intention list.
  • the above execution intention list may also be called the first list.
  • the embodiment of the present application does not limit the name of the execution intention list.
  • the expanded intent list may include intents corresponding to the voice that users say when expressing common voice commands that are not direct and have a high misrecognition rate. Any extended intent in the extended intent list can be associated with one or more execution intentions in the execution intent list.
  • the voices matching the execution intentions in the above execution intention list are all direct and unambiguous. After the electronic device receives the voice matching the above execution intention, it can clearly determine what the user wants to do. In actual voice interaction scenarios, users may also speak questionable voices when giving voice commands to electronic devices.
  • the above-mentioned doubtful voice can mean that after the electronic device receives the voice, it cannot determine whether the user is giving a voice command or speaking the voice in a scenario other than giving a voice command (such as a scenario of chatting with other people). In other words, the above-mentioned doubtful speech sounds have a high misrecognition rate.
  • the electronic device directly determines the above-mentioned doubtful voice as the user giving a voice command and performs the corresponding operation, it may result in the user not giving the voice command, and the electronic device frequently responds to perform voice interaction with the user, resulting in a poor user experience. However, if the electronic device directly determines the above-mentioned doubtful voice as the user has not issued a voice command and does not respond, it may cause the user to actually issue a voice command, but the electronic device does not respond, and the user experience is poor.
  • the user speaks the voice "I'm so hot.”
  • One situation is when the user is giving voice commands.
  • the user says “I'm so hot” and expects the electronic device to turn on the air conditioner.
  • Another situation is when the user does not issue a voice command.
  • the user said “I'm so hot” while chatting with other people.
  • the electronic device recognizes the received voice as "I'm so hot” and can further confirm to the user whether the user wants to issue a voice command.
  • the electronic device can detect whether the received voice matches the extended intention in the extended intention list according to the extended intention list. Matching the voice spoken by the user with the extended intent may indicate that there is doubt in the voice spoken by the user.
  • the electronic device can confirm to the user whether the user wants to implement the execution intention associated with the extended intention according to the execution intention associated with the extended intention. After determining that the user wants to implement the execution intention associated with the extended intention, the electronic device can perform an operation corresponding to the execution intention and perform voice interaction with the user.
  • the above Table 4 is only an exemplary description of the extended intention list in this embodiment of the present application, and should not limit the extended intention list.
  • the expanded intent list can also contain more or fewer intents.
  • any extended intent in the extended intent list can be associated with one or more execution intentions in the execution intent list.
  • the execution intention list includes the execution intention "turn on the air conditioner".
  • the extended intent "I'm so hot” in Table 4 can be associated with the execution intent "Turn on the air conditioner".
  • the electronic device can prompt the user to say the execution intention "turn on the air conditioner” associated with "I am so hot”. Then, when the electronic device recognizes that the received voice is "turn on the air conditioner", the electronic device can perform an operation corresponding to the intention of "turn on the air conditioner", that is, turn on the air conditioner.
  • the above-mentioned extended intention list may be preset.
  • an electronic device installs a voice assistant, it can obtain and store a preset expanded intent list.
  • the electronic device can also expand the intent list through self-learning during voice interaction. Make adjustments.
  • the extended intent in the extended intent list is added to the execution intent list, thereby adjusting the extending intent to an execution intent.
  • the electronic device can also receive the user's operation to adjust the execution intention list to adjust the execution intention list.
  • the above extended intent list may also be called a second list.
  • the embodiment of the present application does not limit the name of the extended intent list.
  • a voice assistant can be an application used to implement voice interaction in electronic devices.
  • the voice assistant can be preset in the electronic device when it leaves the factory.
  • the voice assistant can also be installed by the electronic device in response to the user's operation of installing the voice assistant, or when the electronic device system is updated.
  • the embodiments of this application do not limit the implementation method of installing a voice assistant on an electronic device.
  • the electronic device when the electronic device installs the voice assistant, it can obtain and store the low-computing power speech recognition model, execution intention list and extended intention list.
  • the electronic device when the electronic device installs a voice assistant, it can also obtain and store a high-computing speech recognition model.
  • the state of the voice assistant may include a sleep state and a wake state.
  • the voice assistant can be in sleep state without waking it up.
  • the electronic device can run a low-computing speech recognition model to identify whether the received voice matches the execution intention.
  • the electronic device can perform the operation corresponding to the execution intention and wake up the voice assistant.
  • the electronic device 100 uses a low-computing power speech recognition model to recognize the intention in the user's voice, the electronic device 100 can use the above execution intention list and the extended intention list.
  • the electronic device When waking up the voice assistant, the electronic device can switch the voice assistant from a sleep state to a wake-up state. When the voice assistant is in the awake state, the electronic device can run a high-computing speech recognition model to identify the intention corresponding to the received voice, thereby conducting voice interaction with the user. When the electronic device 100 uses a high-computing power speech recognition model to recognize the intention in the user's voice, the electronic device 100 does not need to use the execution intention list and the extended intention list.
  • the electronic device can switch the voice assistant from the wake state to the sleep state.
  • electronic devices can run low-computing speech recognition models when the voice assistant is in sleep state.
  • Electronic devices can run high-computing speech recognition models when the voice assistant is awake. Since the power consumption of the low-computing speech recognition model is low, the electronic device continues to run the low-computing speech recognition model when the voice assistant is not awakened, which can consume as little power as possible and provide users with a full-time wake-up-free experience. experience.
  • the above sleep state may also be called the first state.
  • the above wake-up state can also be called the second state.
  • the embodiments of the present application do not limit the names of the above sleep states and wake states.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) wait.
  • application processor application processor, AP
  • modem processor graphics processing unit
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • baseband processor baseband processor
  • NPU neural-network processing unit
  • different processing units can be independent devices or integrated in one or more processors.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charging device. While the charging management module 140 charges the battery 142, it can also provide power to the electronic device through the power management module 141.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, internal memory 121, external memory, display screen 194, camera 193, wireless communication module 160, etc.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in electronic device 100 may be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization. For example: Antenna 1 can be reused as a diversity antenna for a wireless LAN. In other embodiments, antennas may be used in conjunction with tuning switches.
  • the mobile communication module 150 can provide solutions for wireless communication including 2G/3G/4G/5G applied on the electronic device 100 .
  • the mobile communication module 150 may include at least one filter, switch, power amplifier, low noise amplifier (LNA), etc.
  • the mobile communication module 150 can receive electromagnetic waves through the antenna 1, perform filtering, amplification and other processing on the received electromagnetic waves, and transmit them to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor and convert it into electromagnetic waves through the antenna 1 for radiation.
  • the electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • the display screen 194 is used to display images, videos, etc.
  • the electronic device 100 may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the electronic device 100 can implement the shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, the light is transmitted to the camera sensor through the lens, the optical signal is converted into an electrical signal, and the camera sensor passes the electrical signal to the ISP for processing, and converts it into an image visible to the naked eye.
  • Camera 193 is used to capture still images or video.
  • the electronic device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • NPU is a neural network (NN) computing processor.
  • NN neural network
  • Intelligent cognitive applications of the electronic device 100 can be implemented through the NPU, such as image recognition, face recognition, speech recognition, text understanding, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • the electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into analog audio signal output, and is also used to convert analog audio input into digital audio signals. Audio module 170 may also be used to encode and decode audio signals. In some examples, the audio module 170 may be provided on the processor 110, or some functional modules of the audio module 170 are provided in the processor 110. Speaker 170A, also called “speaker”, is used to convert audio electrical signals into sound signals. Receiver 170B, also called “earpiece”, is used to convert audio electrical signals into sound signals. Microphone 170C, also called “microphone” or “microphone”, is used to convert sound signals into electrical signals. The headphone interface 170D is used to connect wired headphones.
  • the sensor module 180 may include a pressure sensor, a gyroscope sensor, an air pressure sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a proximity light sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.
  • the buttons 190 include a power button, a volume button, etc.
  • the motor 191 can generate vibration prompts.
  • the indicator 192 may be an indicator light, which may be used to indicate charging status, power changes, or may be used to indicate messages, missed calls, notifications, etc.
  • the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of this application uses a layered architecture
  • the system is taken as an example to illustrate the software structure of the electronic device 100 .
  • the layered architecture divides the software into several layers, and each layer has clear roles and division of labor.
  • the layers communicate through software interfaces.
  • the The system is divided into four layers, from top to bottom: application layer, application framework layer, Android runtime and system library, and kernel layer.
  • the application package can include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, short message, voice assistant, etc.
  • voice assistant please refer to the introduction of the previous embodiment.
  • the application framework layer provides APIs and programming frameworks for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer can include window manager, content provider, view system, phone manager, resource manager, notification manager, activity manager, etc.
  • a window manager is used to manage window programs.
  • the window manager can obtain the display size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, etc.
  • a view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
  • the phone manager is used to provide communication functions of the electronic device 100 .
  • call status management including connected, hung up, etc.
  • the resource manager provides various resources to applications, such as localized strings, icons, pictures, layout files, video files, etc.
  • the notification manager allows applications to display notification information in the status bar (such as the pull-down notification bar), which can be used to convey notification-type messages and can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be notifications that appear in the status bar at the top of the system in the form of charts or scroll bar text, such as notifications for applications running in the background, or notifications that appear on the screen in the form of conversation windows. For example, text information is prompted in the status bar, a beep sounds, the electronic device vibrates, the indicator light flashes, etc.
  • Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
  • the core library contains two parts: one is the functional functions that need to be called by the Java language, and the other is the core library of Android.
  • System libraries can include multiple functional modules. For example: surface manager (surface manager), media libraries (Media Libraries), 3D graphics processing libraries (for example: OpenGL ES), 2D graphics engines (for example: SGL), etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, composition, and layer processing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • the voice interaction system 30 may include an electronic device 100 and a cloud server 200 .
  • a communication connection may be established between the electronic device 100 and the cloud server 200 .
  • the embodiment of the present application does not limit the communication method between the electronic device 100 and the cloud server 200.
  • the cloud server 200 may include a high computing power speech recognition model.
  • all processes of voice interaction can be completed on the electronic device 100 .
  • the electronic device 100 can run a local low-computing power speech recognition model to quickly respond to voice commands issued by the user.
  • the electronic device 100 can use a local high-computing power speech recognition model to perform speech recognition to perform voice interaction.
  • the electronic device 100 can also communicate with the cloud server 200 and use the high computing power speech recognition model on the cloud server 200 to perform speech recognition.
  • the electronic device 100 can adopt the voice recognition result that obtains the voice recognition result the fastest.
  • the electronic device 100 may determine the accuracy of speech recognition using a local high-computing speech recognition model and the accuracy of speech recognition using a high-computing speech recognition model on the cloud server 200 .
  • the electronic device 100 can adopt speech recognition results with higher accuracy.
  • the embodiments of this application do not limit the above-mentioned method of realizing voice interaction through device-cloud integration.
  • the electronic device 100 can not only use the local high-computing power speech recognition model to perform speech recognition, but also use the high-computing power speech recognition model in the cloud server 200 to perform speech recognition, so as to provide users with unnecessary By performing a wake-up operation, you can experience multiple rounds of voice interaction with the electronic device 100 .
  • the above-mentioned electronic device 100 is used as a vehicle-mounted computer as an example to introduce the voice interaction method in the vehicle-mounted scenario.
  • the voice interaction method provided by this application can also be applied to other scenarios.
  • Figures 4A to 4C exemplarily illustrate a wake-up-free voice interaction scenario provided by embodiments of the present application.
  • the voice assistant in the electronic device 100 may be in a sleep state.
  • the electronic device 100 may display the user interface 410 shown in FIG. 4A.
  • User interface 410 may be the desktop of electronic device 100 .
  • the user interface 410 may display application icons (such as navigation application icons, radio application icons, music application icons, etc.), time controls and other interface elements.
  • the embodiment of the present application does not limit the content displayed on the user interface 410.
  • the electronic device 100 can run a low-computing power speech recognition model to identify whether the detected voice matches the execution intention in the execution intention list.
  • the user gives the voice command "Play Song 1" to the electronic device 100 in the car.
  • the execution intention list stored by the electronic device 100 includes the execution intention "Play Song 1".
  • the electronic device 100 may detect the voice "Play song 1" in the environment.
  • the electronic device 100 can use a low-computing power speech recognition model to recognize that the speech matches the execution intention "Play Song 1". Then, the electronic device 100 may To perform the operation corresponding to the execution intention, that is, call the music application to play song 1.
  • the electronic device 100 can also wake up the voice assistant and switch the voice assistant from sleep state to wake state.
  • the electronic device 100 may announce "Okay, play song 1 for you" and start playing song 1.
  • the electronic device 100 can display the user interface 420 shown in FIG. 4B.
  • the user interface 420 may include a song playing component 411 and a wake-up indicator 412 .
  • the song playing component 411 may be used to indicate the song currently being played by the electronic device 100 .
  • the song name "Song 1" is displayed in the song playing component 411, which may indicate that the electronic device 100 is currently playing Song 1.
  • the song playing component 411 may also include a pause control, a next song control, and a previous song control, so that the user can control the music played by the electronic device 100 through the controls in the song playing component 411 .
  • the song playing component 411 may also include lyrics (not shown in the figure).
  • the embodiment of the present application does not limit the content included in the song playing component 411.
  • the wake-up flag 412 may be used to indicate that the voice assistant in the electronic device 100 is in a wake-up state. That is to say, when the voice assistant is in the awake state, the electronic device 100 can display the wake-up logo 412 on the user interface 420 .
  • the electronic device 100 can run a high-computing speech recognition model to identify the intention corresponding to the received voice, thereby performing voice interaction with the user.
  • the electronic device 100 can run one or more high-computing power speech recognition models when the voice assistant is in the awake state.
  • the embodiments of this application do not limit the number of high-computing power speech recognition models.
  • the electronic device 100 can wake up the voice assistant, use a high-computing power speech recognition model to better respond to the user's subsequent requests, and achieve multiple rounds of voice interaction with the user.
  • the voice assistant when the voice assistant is in the awake state, the user issues a voice command "close the window” to the electronic device 100 in the car.
  • the electronic device 100 may detect the voice "Close window” in the environment.
  • the electronic device 100 can recognize the intention of the voice through a high-computing power speech recognition model, and perform an operation corresponding to the intention, that is, calling a module that controls the car window to close the car window.
  • the electronic device 100 can play the voice "OK, we are closing the car window for you” and close the car window.
  • the electronic device 100 can display the user interface 430 shown in FIG. 4C.
  • the user interface 430 may include a wake-up identification 412 and a voice announcement component 413.
  • the voice broadcast component 413 may display content voice broadcast by the electronic device 100 in response to the user's voice instruction.
  • the electronic device 100 can switch the voice assistant from the wake state to the sleep state. This can prevent the electronic device 100 from running a high-computing power speech recognition model for a long time and consuming too much power when the user does not issue a voice command.
  • the user can directly issue voice commands without performing a wake-up operation to wake up the voice assistant.
  • This can help users quickly control electronic devices with voice in some common scenarios (such as controlling hardware devices in the car, listening to music, navigation, etc.).
  • the user can continuously issue multiple voice commands to the electronic device 100 and perform multiple rounds of voice interaction with the electronic device 100 . In these multiple rounds of voice interaction, the user does not need to perform a wake-up operation to wake up the voice assistant.
  • the above embodiments can improve the fluency of voice interaction between users and electronic devices.
  • FIGS 5A and 5B exemplarily illustrate another wake-up-free voice interaction scenario provided by embodiments of the present application.
  • the voice assistant in the electronic device 100 is currently in a sleep state.
  • the user speaks the voice "I'm so hot”.
  • the extended intention list stored by the electronic device 100 includes the extended intention “I am so hot”.
  • the extended intention "I am so hot” is associated with the execution intention "Turn on the air conditioner" in the execution intention list.
  • the electronic device 100 can detect the voice "I'm so hot” in the environment.
  • the electronic device 100 may use a low-computing power speech recognition model to identify that there is no execution intention matching the voice in the execution intention list. Then, the electronic device 100 may determine whether the voice matches the extended intention in the extended intention list. Since the extended intention list includes the intention "I am so hot", the electronic device 100 can recognize that the detected voice matches the extended intention.
  • the electronic device 100 may confirm to the user whether to issue a voice command. Specifically, the electronic device 100 may display the user interface 510 shown in FIG. 5A according to the execution intention associated with the extended intention "I am so hot". User interface 510 may include prompt box 421.
  • the prompt content in the prompt box 421 can be used to guide the user to speak a voice that matches the execution intention associated with the extended intention "I am so hot".
  • the prompt content in the prompt box 421 may be: You can say "turn on the air conditioner" to me.
  • the user can speak "Turn on the air conditioner” according to the prompt of prompt box 421 shown in Figure 5A.
  • the electronic device 100 can detect the voice "turn on the air conditioner” in the environment.
  • the voice assistant in the electronic device 100 is still in sleep state.
  • the electronic device 100 can use a low-computing power speech recognition model to recognize that the speech matches the execution intention of "turn on the air conditioner".
  • the electronic device 100 can perform an operation corresponding to the execution intention, that is, calling a module for controlling the air conditioner in the car to turn on the air conditioner.
  • the electronic The device 100 can also wake up the voice assistant and switch the voice assistant from sleep state to wake state.
  • the embodiment of the present application does not limit the manner in which the electronic device 100 confirms with the user whether to issue a voice command when it is recognized that the detected voice matches the extended intention.
  • the electronic device 100 may still keep the voice assistant in Awakening state. It is understandable that when a user speaks a voice in the car that matches the extended intention (such as "I'm so hot"), it does not necessarily mean that the user is giving a voice command. If the user then follows the prompts of the electronic device 100 and speaks the corresponding voice (such as turning on the air conditioner), it can be indicated that the user is giving a voice command by speaking the voice that matches the extended intention in the car. If the user does not speak the corresponding voice according to the prompts of the electronic device 100, it may mean that the user is not giving a voice command (for example, he may be chatting with other people) by speaking the voice that matches the extended intention in the car.
  • the electronic device can also prompt the user to speak a more direct and unambiguous voice command (i.e., with the expanded intent list) in response to the user's doubtful voice.
  • Intent-associated execution intent matches the voice) to determine whether the user issues a voice command.
  • the above embodiment can prompt the user to speak a more direct statement when the user speaks a voice that matches the extended intention and actually wants to issue a voice command, so that the electronic device can perform the operation corresponding to the voice command the user wants to issue. .
  • execution intentions can be divided into substantive intentions and non-substantive intentions.
  • Substantive intentions can be composed of sentences and entities.
  • the execution intention list can correspond to a sentence list and an entity list.
  • the electronic device 100 uses a low-computing power speech recognition model to recognize the received speech input and obtains the text sequence with the highest probability, the electronic device 100 can determine whether the sentence pattern of the text sequence exists in the above sentence pattern list. If the sentence pattern of the text sequence exists in the above-mentioned sentence pattern list (that is, the voice spoken by the user hits the sentence pattern), the electronic device 100 can determine whether the entity in the text sequence exists in the above-mentioned entity list.
  • the electronic device 100 can search from the entity list according to the entity category corresponding to the entity placeholder in the sentence pattern whether the text sequence is included in the entity category. entities in .
  • the electronic device 100 does not need to compare the above text sequence with all tangible intentions in the execution intention list one by one, which can simplify the process of the electronic device 100 identifying whether the detected voice matches the execution intention.
  • the prefix and suffix list may include prefixes: "please”, “help me”, “please help me” and suffixes "le", “bar”, etc.
  • the embodiments of this application do not specifically limit the above prefixes and suffixes.
  • the electronic device 100 can use a low-computing power speech recognition model to recognize the detected speech. Identify the sounds and get the text sequence with the highest probability. The electronic device 100 can determine whether the text sequence contains the prefix and suffix in the above-mentioned suffix list. If so, the electronic device 100 can remove the prefix and/or suffix in the text sequence and then compare it with the execution intention.
  • the electronic device 100 may also use a low-computing power speech recognition model to identify whether the detected voice matches the extended intention. For specific methods, please refer to the above-mentioned method of identifying whether the detected voice matches the execution intention. I won’t go into details here.
  • Figure 6 exemplarily shows a flow chart of a voice interaction method. As shown in Figure 6, the method may include steps S611 to S624. in:
  • the electronic device 100 can collect sounds in the surrounding environment in real time through a microphone. When the user speaks near the electronic device 100, the electronic device 100 can detect voice, such as voice 1, in the collected sounds through the processor.
  • the electronic device 100 can determine whether the voice 1 contains the wake-up word. Among them, the electronic device 1 can use the wake-up speech recognition model to determine whether the speech 1 contains the wake-up word.
  • the above wake-up speech recognition model and the low computing power speech recognition model in this application may be the same model, or they may be different models.
  • the embodiment of the present application does not limit the specific method of determining whether speech 1 contains a wake-up word.
  • the electronic device 100 may perform the following step S613.
  • the electronic device 100 may perform the following step S616.
  • the electronic device 100 can collect sounds in the surrounding environment through the microphone, detect the voice contained in the collected sounds through the processor, and identify the user's voice in the detected voice through a high-computing power speech recognition model. Intent, and perform the operation corresponding to the intent.
  • the above-mentioned voice 1 is the voice corresponding to the user saying the wake-up word (such as "Xiaoyi Xiaoyi").
  • the electronic device 100 After the electronic device 100 recognizes that the voice 1 contains the wake-up word, it can wake up the voice assistant.
  • the electronic device 100 After detecting voice 1, the electronic device 100 also detects the voice corresponding to the user's voice command "open the car window".
  • the electronic device 100 can run a high-computing speech recognition model to identify the intention corresponding to the detected voice command (ie, open the car window). Then, the electronic device 100 can call the module that controls the vehicle window to open the vehicle window.
  • the electronic device 100 can also wake up the voice assistant in response to other wake-up operations (such as operations on physical buttons or virtual buttons).
  • the user can still wake up the voice assistant first through the wake-up operation, and then issue a voice command to the electronic device 100 .
  • the electronic device 100 can recognize the voice command through the voice assistant and perform operations corresponding to the voice command.
  • the electronic device 100 After the electronic device 100 wakes up the voice assistant, it can detect whether the user performs voice interaction with the electronic device 100 in real time. If it is detected that the user has voice interaction with the electronic device 100, the electronic device 100 can keep the voice assistant in the awake state, recognize the voice command, and perform operations corresponding to the voice command. If no voice interaction between the user and the electronic device 100 is detected within a preset time period, the electronic device 100 can switch the voice assistant to a sleep state, thereby saving power consumption of the electronic device 100 .
  • the above-mentioned preset time period may be, for example, 1 minute, 2 minutes, etc. The embodiment of the present application does not limit the value of the preset time period.
  • the electronic device 100 can keep the voice assistant in the awake state. In this way, the user does not need to perform frequent wake-up operations when performing multiple rounds of voice interaction with the electronic device 100 .
  • the electronic device 100 can determine whether the voice assistant is currently in the wake state.
  • the electronic device 100 may perform the above-mentioned step S614. Specifically, the electronic device 100 can recognize the user's intention in the voice 1 through a high-computing speech recognition model, and perform operations corresponding to the intention.
  • the electronic device 100 may perform the following step S617.
  • the voice assistant is not in the awake state, that is, the voice assistant is in the sleep state.
  • the implementation method for the electronic device 100 to determine whether the voice 1 matches the execution intention may refer to the introduction of the foregoing embodiments.
  • the above-mentioned voice 1 may be the voice corresponding to the user saying "play song 1".
  • the execution intention list contains the execution intention "Play Song 1".
  • voice 1 is detected, the voice assistant in the electronic device 100 is in a sleep state.
  • the electronic device 100 can use a low-computing power speech recognition model to determine that speech 1 matches the execution intention "play song 1". Then, the electronic device 100 can perform the operation corresponding to the execution intention "play song 1", that is, start playing song 1.
  • the above-mentioned voice 1 can be the voice corresponding to "I am so hot” spoken by the user.
  • the list of extended intentions includes extended intention 1 "I'm so hot”.
  • the execution intention list includes execution intention 1 "turn on the air conditioner”.
  • voice 1 is detected, the voice assistant in the electronic device 100 is in a sleep state.
  • the electronic device 100 can use a low-computing power speech recognition model to determine that speech 1 matches the above-mentioned extended intention 1.
  • the electronic device 100 may prompt the user to say execution intention 1, that is, prompt the user to say "turn on the air conditioner".
  • the electronic device 100 can display a prompt box 421 on the screen.
  • the electronic device 100 can keep the voice assistant in a sleep state.
  • voice 1 as “I'm so hot”
  • extended intent 1 as “I'm so hot”
  • execution intent 1 as "turn on the air conditioner” as examples for explanation. If after the above step S621, the user says “turn on the air conditioner” according to the prompt of the electronic device 100, it can mean that the user said the above voice 1 to want the electronic device 100 to turn on the air conditioner. Afterwards, the electronic device 100 can turn on the air conditioner after detecting the voice matching the execution intention 1. This can reduce the possibility of missing voice commands given by the user.
  • step S621 the user ignores the prompt of the electronic device 100 and does not say "turn on the air conditioner", it can mean that the user's utterance of the above-mentioned voice 1 is not a voice command. Afterwards, the electronic device 100 continues to keep the voice assistant in the sleep state without turning on the air conditioner. This reduces instances of voice misresponses to non-voice commands spoken by the user.
  • the above-mentioned execution intention list and extended intention list can reduce the impact of user chatting sounds and environmental noise on the electronic device 100 accurately recognizing voice commands without waking up the voice assistant.
  • the electronic device 100 can quickly respond to the user's voice instructions without waking up the voice assistant.
  • the user can issue voice commands at any time to instruct the electronic device 100 to perform corresponding operations without waking up the voice assistant.
  • the electronic device 100 can also wake up the voice assistant in addition to performing the operation corresponding to the execution intention. In this way, the electronic device 100 can more accurately recognize the user's subsequent requests after the voice assistant wakes up, so as to accept the user's voice control.
  • the electronic device 100 can switch the state of the voice assistant between the sleep state and the awake state according to the voice interaction between the user and the electronic device 100 . This can save the power consumption of the electronic device 100 as much as possible while providing the user with a full-time wake-up-free experience.
  • the electronic device 100 can also adjust the execution intention list through self-learning during the voice interaction process, so that the intentions contained in the execution intention list are closer to the user's common voice commands, thereby improving the user's interaction with the electronic device.
  • the device 100 performs voice interaction usage experience.
  • Figures 7A and 7B exemplarily illustrate a voice interaction scenario provided by embodiments of the present application.
  • the voice assistant in the electronic device 100 may be in a sleep state.
  • the user issues a voice command "play song 2" to the electronic device 100 in the car.
  • the execution intention list stored by the electronic device 100 includes the execution intention "Play Song 1", but does not include "Play Song 2". That is to say, the sentence pattern list corresponding to the execution intention list includes the sentence pattern "play [song title]".
  • the entity list corresponding to the execution intention list contains "Song 1" but not "Song 2".
  • the above-mentioned song 1 may be a popular song determined based on statistical data.
  • the above-mentioned song 2 may be a non-popular song determined based on statistical data. Whether a song is a hit can be determined by the song's on-demand rate.
  • the title of the above-mentioned popular song can be preset in the entity list.
  • the electronic device 100 can add "Song 1" to the entity list without adding "Song 2" to the entity list.
  • the embodiment of this application does not limit the specific content included in the entity list.
  • the entity list includes "Song 1" but does not include “Song 2" as an example for description.
  • the electronic device 100 can voice broadcast "I'm sorry, I didn't hear it clearly, please say it again.”
  • the embodiment of the present application does not limit the method by which the electronic device 100 instructs the user to repeat the voice.
  • the electronic device 100 can The user interface 710 is shown with a wake-up logo 412 displayed on it.
  • the user says “Play Song 2" again according to the instruction of the electronic device 100 shown in FIG. 7A .
  • the electronic device 100 may detect the voice "Play song 2" in the environment.
  • the voice assistant is currently awake.
  • the electronic device 100 can use a high-computing speech recognition model to recognize the user's intention in the voice, and perform operations corresponding to the intention.
  • the electronic device 100 may voice "Okay, let's play song 2 for you" and start playing song 2.
  • the electronic device 100 may also display the user interface 720 shown in Figure 7B.
  • the user interface 720 may include a voice broadcast component 711 and a song play component 712.
  • the above-mentioned voice broadcast component 711 may display the content of the voice broadcast when the electronic device 100 performs voice interaction with the user.
  • the above-mentioned song playing component 712 please refer to the aforementioned introduction of the song playing component 411 shown in FIG. 4B.
  • the electronic device 100 detects the voice "Play Song 2" again when the voice assistant is in the sleep state, since the entity "Song 2" has been added to the entity list corresponding to the execution intention list at this time, the electronic device 100 can use the low
  • the computing power speech recognition model determines that the voice matches the execution intention in the execution intention list, so that it can directly respond to the user's voice command and start playing song 2 without waking up the voice assistant.
  • the voice assistant when the voice assistant is in the awake state, if the voice detected by the electronic device 100 hits a sentence pattern in the sentence pattern list corresponding to the execution intention list but misses an entity in the entity list corresponding to the execution intention list, the electronic device 100 The device 100 may add the entity in the voice to the entity list corresponding to the execution intention list.
  • the electronic device 100 after the electronic device 100 plays song 2, the user wants to switch the played song to song 3.
  • the user can speak the voice "Play song 3". Since the voice assistant is still awake, the electronic device 100 can recognize the user's intention in the voice through a high-computing speech recognition model, and perform an operation corresponding to the intention (ie, play song 3).
  • the execution intention list does not include the execution intention "Play Song 3" (that is, the entity list corresponding to the execution intention list does not contain the entity "Song 3")
  • the electronic device 100 can add the voice to the entity list corresponding to the execution intention list.
  • Entity "Song 3" in .
  • the entity "Song 3" can belong to the entity of the song title class in the entity list.
  • the user can implement voice control of the electronic device 100 under the instruction of the electronic device 100 .
  • the user does not need to perform a wake-up operation.
  • the electronic device 100 can add an execution intention that matches the semantic instruction issued by the user to the execution intention list through self-learning during the voice interaction process. In this way, when the electronic device 100 subsequently detects the same voice command again, it can quickly respond to the user's voice command without waking up the voice assistant. In other words, when the user issues the same voice command again later, he does not need to perform the wake-up operation.
  • FIG. 8 schematically illustrates a method for the electronic device 100 to self-learn to adjust the execution intention list provided by the embodiment of the present application.
  • the list of sentence patterns may include the following sentence patterns: "Play [song name]", “Play [singer name]'s song”, “Navigate to [place name]”.
  • the entity list may include a song title type entity "Song 1", a singer type entity “Singer 1”, and a place name entity "Place 1".
  • the execution intention "Play Song 2" is not included in the above execution intention list. That is, the entity “Song 2" is not included in the entity list.
  • the voice assistant in the electronic device 100 is in a sleep state.
  • the electronic device 100 can use a low-computing power speech recognition model to determine that the sentence pattern of the voice matches the sentence pattern "play [song title]" in the sentence list corresponding to the execution intention list. , and the entities of this voice do not match the entities in the entity list corresponding to the execution intention list. In other words, the speech in S81 above hits the sentence pattern but misses the entity.
  • the electronic device 100 instructs the user to speak the voice repeatedly, wakes up the voice assistant, and runs a high-computing power speech recognition model.
  • the entity intention "Play Song 2" is added to the execution intention list.
  • the entity “Song 2” is added to the entity list corresponding to the execution intention list.
  • the electronic device 100 can then quickly respond to the user's voice command "Play Song 2" when the voice assistant is in the sleep state.
  • electronic device 100 may cache detected speech. When it is determined that the detected voice hits the sentence pattern but does not hit the entity, the electronic device 100 can obtain the voice from the storage module. Then, after the voice assistant wakes up, the electronic device 100 can use a high-computing speech recognition model to recognize the user's intention in the voice, and perform operations corresponding to the intention. The above method can prevent the user from speaking the same voice command again and improve the user's voice interaction experience with the electronic device 100 .
  • the self-learning shown in Figure 8 can also be called entity self-learning. It is understandable that a sentence pattern with entity intent can support placing any entity under the same entity category at the location of the entity placeholder.
  • the electronic device 100 can detect which entities are included in the voice instructions given by the user while the user is using the voice assistant, and add the entities mentioned by the user to the entity list. This allows the user to quickly perform operations corresponding to the user's frequently used voice commands without performing a wake-up operation, such as playing songs that the user often listens to, navigating to places that the user often visits, etc.
  • the above self-learning method of adjusting the execution intention list can make the intentions contained in the execution intention list closer to the user's common voice commands, thereby improving the user's voice interaction experience with the electronic device.
  • FIG. 9 schematically illustrates a method for the electronic device 100 to self-learn to adjust the execution intention list provided by the embodiment of the present application.
  • the extended intention list currently stored by the electronic device 100 includes the following extended intentions: “I am so hot” and "The volume is too low”.
  • the user speaks the voice "I'm so hot” near the electronic device 100 .
  • the voice assistant in the electronic device 100 is in a sleep state.
  • the electronic device 100 may utilize a low-computing power speech recognition model to recognize that the detected speech matches the extended intention "I am so hot”.
  • the electronic device 100 may prompt the user to say the execution intention "turn on the air conditioner” associated with the extended intention "I am so hot”.
  • the electronic device 100 can call the module that controls the air conditioner to turn on the air conditioner.
  • the one or more execution intentions in the execution intention list can be moved to Expand the intent list so that this one or more Action intentions are adjusted to expansion intentions.
  • the electronic device 100 may detect the frequency of withdrawing or canceling the above-mentioned executed operation in response to the user's operation within a preset time period after executing an operation corresponding to the execution intention. If the frequency is higher than the preset frequency, it means that the misrecognition rate of the above execution intention is high.
  • the electronic device 100 can adjust the above-mentioned one execution intention to an expansion intention.
  • the operator of the above-mentioned voice assistant may find that one or more execution intentions have a high misrecognition rate during the testing process of the above-mentioned low-computing power speech recognition model.
  • the operator of the above-mentioned voice assistant can collect user feedback and determine based on the user feedback that one or more execution intentions have a high misrecognition rate.
  • the embodiments of this application do not limit the implementation method of determining whether the execution intention has a high misrecognition rate.
  • each user interface described in the embodiment of the present application is only an example interface and does not limit the solution of the present application.
  • the user interface can adopt different interface layouts, can include more or fewer controls, and can add or reduce other functional options. As long as it is based on the same inventive idea provided by this application, it is all within the scope of protection of this application. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

一种语音交互方法及相关装置。电子设备(100)在语音助手未被唤醒时可以识别检测到的语音是否与预设的意图匹配。若匹配,电子设备(100)可以执行该语音匹配的意图对应的操作,并唤醒语音助手。电子设备(100)唤醒语音助手后可以更准确地响应用户后续的请求。其中,若语音助手唤醒后的预设时间段内无语音交互,电子设备(100)可以将语音助手从唤醒态切换到睡眠态。在上述方法中,电子设备(100)可以在语音助手处于睡眠态时快速响应用户的请求。用户可以无需唤醒语音助手,随时下达语音指令指示电子设备执行相应的操作。

Description

语音交互方法及相关装置
本申请要求于2022年06月25日提交中国专利局、申请号为202210728191.8、申请名称为“语音交互方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及终端技术领域,尤其涉及语音交互方法及相关装置。
背景技术
目前越来越多的设备可以提供语音交互功能,方便用户通过语音来控制设备。例如,用户可以向设备下达语音指令“播放音乐”。设备可以在识别出该语音指令后,播放音乐。但用户每次向设备下达语音指令时,都需要先通过唤醒词唤醒设备中的语音交互应用,然后再说出语音指令。这就导致用户与设备进行语音交互的过程不流畅,用户需要频繁说唤醒词来实现语音控制设备的目的,用户体验较差。
发明内容
本申请提供语音交互方法及相关装置。上述方法可以在节约电子设备功耗的基础上,给用户带来全时免唤醒的语音交互体验。用户可以无需唤醒语音助手,随时下达语音指令指示电子设备执行相应的操作。
第一方面,本申请提供一种语音交互方法。该方法应用于电子设备。电子设备包含语音助手。其中,电子设备可以在语音助手处于睡眠态的情况下,接收第一语音。电子设备可以确定第一语音与第一列表中的第一意图匹配,第一列表中包含一个或多个语音指令对应的意图。电子设备可以执行第一意图对应的操作。电子设备可以唤醒语音助手。在语音助手处于唤醒态的情况下,电子设备可以接收第二语音。电子设备可以识别第二语音中的第二意图,执行第二意图对应的操作。
由上述方法可知,在未唤醒语音助手的情况下,电子设备可以实时检测用户说出的语音是否与第一列表中的意图匹配。当检测到用户说出的语音与第一列表中的意图匹配,电子设备可以直接执行该意图对应的操作。其中,第一列表包含与语音指令对应的意图。也即是说,用户可以直接向电子设备下达与第一列表中的意图对应的语音指令,而无需先唤醒语音助手。且在下达语音指令后,电子设备除了执行该语音指令对应的操作,还可以唤醒语音助手。这样,用户还可以进一步向电子设备下达更多的语音指令,从而在不进行唤醒操作的情况下与电子设备进行多轮语音交互。
结合第一方面,在一些实施例中,上述第一列表可以是本申请中的执行意图列表。第一列表中包含的意图可以称为执行意图。第一列表可包含用户常用语音指令对应的意图。上述常用语音指令可以包括使用频率高、误识率低、没有歧义的语音指令。上述误识率可以指将用户说出的不包含语音指令的语音误识别为语音指令的概率。这样可以方便用户在不进行唤醒操作的情况下,直接下达常用语音指令来控制电子设备执行相应的操作。
结合第一方面,在一些实施例中,第一语音和第二语音均不包含用于唤醒语音助手的唤醒词。
结合第一方面,在一些实施例中,电子设备可包含第一语音识别模型和第二语音识别模型。其中,第二语音识别模型的大小大于第一语音识别模型的大小。上述第一语音识别模型的大小和第二语音识别模型的大小可以指语音识别模型所需要的存储空间的大小。语音识别模型的大小越大,可以表示语音识别模型的算力越高。算力可以表示语音识别模型处理、运算数据的能力。即第二语音识别模型的算力高于第一语音识别模型的大小。语音识别模型的算力越低,语音识别模型的功耗越低,所需的计算资源越少。也即是说,在同样的运行条件下,第二语音识别模型的功耗高于第一语音识别模型的功耗。第二语音识别模型所需的计算资源多于第一语音识别模型所需的计算资源。其中,语音识别模型的算力越低,语音识别模型所使用的参数量可能更少。即第二语音识别模型所使用的参数量多于第一语音识别模型所使用的参数量。
在语音助手处于睡眠态的情况下,电子设备可以实时运行第一语音识别模型。其中,电子设备可以利用第一语音识别模型确定第一语音与第一列表中的第一意图匹配。
当唤醒语音助手,在语音助手处于唤醒态的情况下,电子设备可以运行第二语音识别模型。其中,电子设备可以利用第二语音识别模型识别第二语音中的第二意图。电子设备利用第二语音识别模型识别接收到的语音中的意图时,无需使用上述第一列表。
由于第一语音识别模型的功耗低,电子设备在未唤醒语音助手的情况下实时运行低算力语音识别模型通常不会产生过高的功耗,从而也不会导致电子设备发热、运行卡顿等问题。电子设备在语音助手未被唤醒时持续运行低算力语音识别模型,可以实现用尽可能少的功耗,给用户带来全时免唤醒的体验。当识别出检测到的语音与第一意图匹配,电子设备除了执行第一意图对应的操作,还可以唤醒语音助手。这样,电子设备在语音助手唤醒后可以更加准确地是识别用户后续的请求,给用户提供更好的语音交互体验。
结合第一方面,在一些实施例中,唤醒语音助手之后,电子设备还可以在第一时间段内未接收到语音的情况下,将语音助手从唤醒态切换到睡眠态。
其中,上述第一时间段可以是以语音助手处于唤醒态时,电子设备最后一次接收到语音的时刻为起始时刻,时长为预设时长(如5秒、10秒等等)的一段时间。或者,上述第一时间段可以是以语音助手处于唤醒态时,电子设备最后一次从接收到的语音中识别出语音指令的时刻为起始时刻,时长为预设时长的一段时间。或者,上述第一时间段可以是以语音助手处于唤醒态时,电子设备最后一次响应接收到的语音指令执行相应操作的时刻为起始时刻,时长为预设时长的一段时间。
例如,电子设备在接收到上述第二语音之后的一段时间内未检测到环境中有语音。那么,上述第一时间段可以是以电子设备接收到第二语音的时刻为起始时刻,时长为预设时长的一段时间。再例如,电子设备在执行上述第二语音中的第二意图对应的操作后未检测到环境中有语音。那么,上述第一时间段可以是以电子设备执行完成第二意图对应的操作的时刻为起始时刻,时长为预设时长的一段时间。
上述实施例可以避免用户在语音助手被唤醒后没有下达语音指令的情况下,电子设备长时间运行高算力语音识别模型功耗过高,从而节约电子设备的功耗。
结合第一方面,在一些实施例中,第一列表对应第一句式列表和第一实体列表,第一句式列表包含一个或多个句式,第一实体列表包含一个或多个实体,第一列表中的一个或多个意图由第一句式列表中的句式与第一实体列表中的实体组成。电子设备可以在语音助手处于睡眠态的情况下,接收第三语音。电子设备可以确定第三语音的句式与第一句式列表中的第一句式匹配,且第一实体列表中没有与第三语音的第一实体匹配的实体。电子设备可以唤醒语音助手。在语音助手处于唤醒态的情况下,电子设备可以识别第三语音中的第三意图,并执行第三意图对应的操作,第三意图由第一句式和第一实体组成。
其中,第一列表中的意图可以根据意图中有无实体划分为有实体意图和无实体意图。实体可以指一种事物类别下的具体实例。例如,实体对应的事物类别可以包括以下一项或多项:歌名、歌手名、地点名、电影名、电视剧名、图书名、火车车次、航班号、电话号码、邮箱等等。上述实体对应的事物类别也可称为实体类别。有实体意图即为包含实体的意图。其中,有实体意图可以由句式和实体组成。句式可以包含句式主体结构和实体占位符。实体占位符用于确定句式中用于放置实体的位置。有实体意图的句式可以支持在实体占位符所在的位置放置同一种事物类别下的任意实体。无实体意图即为不包含实体的意图。
在语音助手处于睡眠态,电子设备可以利用第一语音识别模型确定第三语音的句式与第一句式列表中的第一句式匹配,且第一实体列表中没有与第三语音的第一实体匹配的实体。然后,在语音助手处于唤醒态,电子设备可以用第二语音识别模型识别第三语音中的第三意图。
由上述实施例可知,即便用户在下达语音指令时,说出的语音与第一列表当前包含的意图不匹配,电子设备仍可以对该语音进行响应,来执行用户下达语音指令对应的操作。上述方法可以更好地为用户提供全时免唤醒的语音交互体验。
结合第一方面,在一些实施例中,当确定出上述第三语音的句式与第一句式列表中的第一句式匹配,且第一实体列表中没有与第三语音的第一实体匹配的实体,电子设备可以提示用户重复上述第三语音(例如,电子设备可以语音播报“我没听清,请再说一遍”),并唤醒语音助手。用户可以根据电子设备的提示重复上述第三语音。在语音助手处于唤醒态的情况下,电子设备可以接收到用户重复上述第三语音的语音,并利用第二语音识别模型对该语音进行识别,识别出该语音中的第三意图。然后,电子设备可以执行该第三意图对应的操作。
结合第一方面,在一些实施例中,当确定出上述第三语音的句式与第一句式列表中的第一句式匹配,且第一实体列表中没有与第三语音的第一实体匹配的实体,电子设备还可以在第一实体列表添加第三语音的第一实体。这样,当用户再次说出与第三语音相同的语音,电子设备可以在语音助手处于睡眠态时,利用第一语音识别模型确定出该语音与第一列表中的意图匹配,从而直接执行该意图对应的操作。
可以看出,电子设备在进行语音交互的过程中还可以通过自学习对第一实体列表进行调整,使得第一实体列表中包含更多用户常用的实体,从而使得第一列表中包含的意图更贴近用户的常用语音指令对应,提升用户与电子设备进行语音交互的使用体验。
结合第一方面,在一些实施例中,电子设备在第一实体列表添加第三语音的第一实体之后,在语音助手处于睡眠态的情况下,接收第四语音。电子设备可以确定第四语音的句式与第一句式列表中的第一句式匹配,且第四语音的实体与第一实体列表中的第一实体匹配,其中,第四语音与第三意图匹配。电子设备可以执行第三意图对应的操作。电子设备可以唤醒语音助手。
可以看出,经过自学习,电子设备可以在第一实体列表添加第一实体。其中,上述第一实体和第一句式可以组成上述第三意图。那么,电子设备在第一实体列表添加第一实体,可以相当于在第一列表中添加了上述第三意图。这样,用户可以直接向电子设备下达与第三意图对应的语音指令,而无需先唤醒语音助手。且在下达与第三意图对应的语音指令,电子设备除了执行该第三意图对应的操作,还可以唤醒语音助手。用户还可以进一步向电子设备下达更多的语音指令,从而在不进行唤醒操作的情况下与电子设备进行多轮语音交互。
结合第一方面,在一些实施例中,电子设备可以在语音助手处于睡眠态的情况下,接收第五语音。电子设备可以确定第五语音与第二列表中的第四意图匹配,第二列表中的一个意图与第一列表中的一个或多个意图关联,其中,第四意图与第一列表中的第五意图关联。电子设备可以提供第一提示,第一提示用于提示用户说出与第五意图匹配的语音。
其中,上述第二列表可以是本申请实施例中的扩展意图列表。第二列表中的意图可以称为扩展意图。第二列表可包含用户在表达常用语音指令时所说的非直接、有较高误识率的语音对应的意图。电子设备可以根据第二列表,检测接收到的语音是否与第二列表中的扩展意图匹配。用户说出的语音与扩展意图匹配可以表示用户说出的语音存在疑义。当检测到接收到的语音与扩展意图匹配,电子设备可以根据扩展意图关联的执行意图,向用户提供上述第一提示,以便确认用户是否想要实现与该扩展意图关联的执行意图。在确定用户想要实现与该扩展意图关联的执行意图后,电子设备可以进行该执行意图对应的操作,与用户进行语音交互。上述实施例可以在不唤醒语音助手的情况下,实现既不会漏识别用户可能下达的语音指令,也不会对用户所说的非语音指令的语音误响应,提高用户的语音交互体验。
结合第一方面,在一些实施例中,电子设备提供第一提示之后,还接收第六语音。电子设备可以确定第六语音与第五意图匹配,执行第五意图对应的操作。电子设备可以唤醒语音助手。
可以看出,通过扩展意图列表,电子设备还可以在响应用户说出的存在疑义的语音,提示用户说出更加直接且毫无疑义的语音指令(即与扩展意图关联的执行意图匹配的语音),从而确定用户是否下达语音指令。用户根据上述第一提示说出与执行意图匹配的语音,可以表示用户想要下达语音指令。这样,电子设备可以在执行用户下达的语音指令对应的操作。上述实施例可以在不唤醒语音助手的情况下,减少漏识别用户可能下达的语音指令的情况,提高用户的语音交互体验。
结合第一方面,在一些实施例中,电子设备提供第一提示之后,在第二时间段内未接收到与第五意图匹配的语音,电子设备可以取消第一提示,保持语音助手处于睡眠态。
其中,上述第一提示可以在电子设备的用户界面中显示第五意图对应的文字信息。电子设备取消第一提示可以为在用户界面上取消显示第五意图对应的文字信息。或者,上述第一提示可以为通过语音播报提示用户说出与第五意图匹配的语音。电子设备取消第一提示可以为停止语音播报提示用户说出与第五意图匹配的语音。
上述第二时间段可以是以电子设备提供第一提示的时刻为起始时刻,时长为预设时长的一段时间。
可以看出,用户说出与第二列表中的第四意图匹配的第五语音后,未在上述第一提示下说出更加直接且毫无疑义的语音指令。那么,用户说出上述第五语音可能并不是想下达语音指令(例如可能是在与他人聊天时说出上述第五语音)。那么,电子设备可以保持语音助手处于睡眠状态。上述实施例可以在不唤醒语音助手的情况下,减少对用户所说的非语音指令的语音误响应的情况,并且上述第一提示并不会对用户产生过多干扰,这可以提高用户的语音交互体验。
结合第一方面,在一些实施例中,第一列表包括第六意图。当确定第六意图的误识率高于第一阈值,电子设备可以在第一列表中移除第六意图,并在第二列表中添加第六意图。
可以看出,将上述误识率较高的第六意图从第一列表移动至第二列表后,电子设备检测到与第六意图匹配的语音时可以先向用户确认是否下达语音指令。在确认用户是下达语音指令的情况,电子设备可以执行与第六意图对应的操作。上述方法可以减少在不唤醒语音助手而下达语音指令的场景中,将非语音指令的语音当做语音指令而导致的误识别情况,提升用户与电子设备进行语音交互的使用体验。
第二方面,本申请提供一种语音交互方法,该方法应用于电子设备。电子设备包含语音助手。其中,电子设备在语音助手处于睡眠态的情况下,接收第一语音。响应于第一语音,电子设备可以提供第一提示, 第一提示用于提示用户说出第一指令。电子设备可以接收第二语音,并确定第二语音与第一指令匹配,执行第一指令对应的操作。
由上述方法可知,在未唤醒语音助手的情况下,电子设备可以实时检测用户说出的语音是否与预设的指令关联,并在用户说出的语音与预设的指令关联的情况下,提示用户说出上述预设的指令,从而执行上述预设的指令对应的操作。也即是说,在上述方法中,用户可以直接向电子设备下达语音指令,而无需先唤醒语音助手,这可以提高用户的语音交互体验。
结合第二方面,在一些实施例中,第一语音和第二语音均不包含用于唤醒语音助手的唤醒词。
结合第二方面,在一些实施例中,上述第一提示可以为在电子设备的用户界面中显示第一指令对应的文字信息。或者,上述第一提示可以为通过语音播报提示用户说出与第一指令匹配的语音。
结合第二方面,在一些实施例中,电子设备可包含第一语音识别模型和第二语音识别模型。其中,第二语音识别模型的大小大于第一语音识别模型的大小。上述第一语音识别模型的大小和第二语音识别模型的大小可以指语音识别模型所需要的存储空间的大小。语音识别模型的大小越大,可以表示语音识别模型的算力越高。算力可以表示语音识别模型处理、运算数据的能力。即第二语音识别模型的算力高于第一语音识别模型的大小。语音识别模型的算力越低,语音识别模型的功耗越低,所需的计算资源越少。也即是说,在同样的运行条件下,第二语音识别模型的功耗高于第一语音识别模型的功耗。第二语音识别模型所需的计算资源多于第一语音识别模型所需的计算资源。其中,语音识别模型的算力越低,语音识别模型所使用的参数量可能更少。即第二语音识别模型所使用的参数量多于第一语音识别模型所使用的参数量。
在语音助手处于睡眠态的情况下,电子设备可以实时运行第一语音识别模型。
当唤醒语音助手,在语音助手处于唤醒态的情况下,电子设备可以运行第二语音识别模型。
结合第二方面,在一些实施例中,上述响应于第一语音,提供第一提示的方法具体可以为,响应于第一语音,电子设备可以利用第一语音识别模型确定第一语音与第一指令关联。电子设备可以根据第一语音与第一指令的关联关系,提供第一提示。
其中,电子设备可存储有第一列表。上述第一列表可以是本申请中的执行意图列表。第一列表中可包含一个或多个语音指令对应的意图。第一列表中包含的意图可以称为执行意图。第一列表可包含用户常用语音指令对应的意图。上述常用语音指令可以包括使用频率高、误识率低、没有歧义的语音指令。上述误识率可以指将用户说出的不包含语音指令的语音误识别为语音指令的概率。这样可以方便用户在不进行唤醒操作的情况下,直接下达常用语音指令来控制电子设备执行相应的操作。
第一列表可对应第一句式列表和第一实体列表,第一句式列表包含一个或多个句式,第一实体列表包含一个或多个实体,第一列表中的一个或多个意图由第一句式列表中的句式与第一实体列表中的实体组成。
在一种可能的实现方式中,上述第一指令对应的意图由第一句式和第一实体组成。上述第一语音与第一指令关联可以指:第一语音的句式为上述第一句式,第一语音的实体为上述第一实体。第一句式列表中包含第一句式。第一实体列表中不包含上述第一实体。在语音助手处于睡眠态的情况下,电子设备可以利用第一语音识别模型确定上述第一语音的句式与第一句式列表中的第一句式匹配,且第一实体列表中没有与第一语音的第一实体匹配的实体。然后,电子设备可以提供第一提示。上述第一提示可以为通过语音播报提示用户说出与第一指令匹配的语音。上述与第一指令匹配的语音即为上述第一语音。也即是说,上述第一提示可用于提示用于重复上述第一语音。
进一步的,电子设备还可以将上述第一实体添加至第一实体列表。这样,电子设备在进行语音交互的过程中还可以通过自学习对第一实体列表进行调整,使得第一实体列表中包含更多用户常用的实体,从而使得第一列表中包含的意图更贴近用户的常用语音指令对应,提升用户与电子设备进行语音交互的使用体验。
示例性地,用户说出的第一语音为“播放歌曲2”。第一语音的第一句式为“播放[歌名]”。第一语音的第一实体为“歌曲2”。上述第一句式列表中包含第一句式。上述第一实体列表中不包含第一实体。当接收到第一语音,电子设备可以确定出第一语音命中第一句式列表中的句式(即第一句式),未命中第一实体列表中的实体。电子设备可以提供第一提示,来提示用户重复上述第一语音(也即上述第一指令)。例如,电子设备可以语音播报“我没听清,请再说一遍”。用户可以根据第一提示说出第二语音“播放歌曲2”。第二语音是对上述第一语音的重复。响应于第二语音,电子设备可以播放歌曲2。
由上述实施例可知,即便用户在下达语音指令时,说出的语音与第一列表当前包含的意图不匹配,电子设备仍可以对该语音进行响应,来执行用户下达语音指令对应的操作。上述方法可以更好地为用户提供全时免唤醒的语音交互体验。
电子设备可存储有第二列表。上述第二列表可以是本申请实施例中的扩展意图列表。第二列表中的意图可以称为扩展意图。第二列表中的意图可以与第一列表中的一个或多个意图关联。第二列表可包含用户在表达常用语音指令时所说的非直接、有较高误识率的语音对应的意图。
在另一种可能的实现方式中,在语音助手处于睡眠状态的情况下,电子设备可以利用第一语音识别模型确定第一语音与第二列表中的第一扩展意图匹配。第一扩展意图与第一列表中的第一执行意图关联。该第一执行意图为上述第一指令对应的意图。上述第一语音与第一指令关联可以指:第一语音与第一扩展意图匹配。然后,电子设备可以提供第一提示,来提示用户说出与第一执行意图(即第一指令)对应的语音。
示例性地,用户说出的第一语音为“我好热”。第二列表中包含的第一扩展意图为“我好热”。第一扩展意图与第一列表中的第一执行意图“打开空调”关联。电子设备可以确定第一语音与上述第一扩展意图匹配。电子设备可以提供第一提示,来提示用户说出第一执行意图“打开空调”。例如,电子设备可以在屏幕上显示:可以对我说“打开空调”。若用户想打开空调,则可以根据第一提示说出第二语音“打开空调”。若用户不想打开空调,则可以不理会上述第一提示。若在提供第一提示后,电子设备接收到与上述第一扩展意图匹配的第二语音,电子设备可以响应第二语音,打开空调。
可以看出,上述实施例可以在不唤醒语音助手的情况下,实现既不会漏识别用户可能下达的语音指令,也不会对用户所说的非语音指令的语音误响应,提高用户的语音交互体验。
结合第二方面,在一些实施例中,在识别出第一语音与第二列表中的第一扩展意图匹配的情况下,电子设备可以保持语音助手处于唤醒状态。当接收到上述第二语音,电子设备可以利用第一语音识别模型确定第二语音与第一指令(也即与第一扩展意图关联的第一执行意图)匹配。然后,响应于第二语音,电子设备可以执行第一指令对应的操作。
并且,在确定第二语音与第一指令匹配的情况下,电子设备还可以唤醒语音助手,以便用户进一步向电子设备下达更多的语音指令,从而在不进行唤醒操作的情况下与电子设备进行多轮语音交互。
示例性地,在语音助手处于唤醒态的情况下,电子设备接收第三语音,并利用第二语音识别模型识别第三语音中的第二指令,执行第二指令对应的操作。
结合第二方面,在一些实施例中,在识别出第一语音命中第一句式列表中的句式,未命中第一实体列表中的实体的情况下,电子设备可以唤醒语音助手。在语音助手处于唤醒态的情况下,电子设备接收上述第二语音,并利用第二语音识别模型识别第二语音中的第一指令。由于电子设备唤醒了语音助手,用户在不进行唤醒操作的情况下,可以继续向电子设备下达更多的语音指令,从与电子设备进行多轮语音交互。
结合第二方面,在一些实施例中,唤醒语音助手之后,电子设备还可以在第一时间段内未接收到语音的情况下,将语音助手从唤醒态切换到睡眠态。
其中,上述第一时间段可以是以语音助手处于唤醒态时,电子设备最后一次接收到语音的时刻为起始时刻,时长为预设时长(如5秒、10秒等等)的一段时间。或者,上述第一时间段可以是以语音助手处于唤醒态时,电子设备最后一次从接收到的语音中识别出语音指令的时刻为起始时刻,时长为预设时长的一段时间。或者,上述第一时间段可以是以语音助手处于唤醒态时,电子设备最后一次响应接收到的语音指令执行相应操作的时刻为起始时刻,时长为预设时长的一段时间。
上述实施例可以避免用户在语音助手被唤醒后没有下达语音指令的情况下,电子设备长时间运行高算力语音识别模型功耗过高,从而节约电子设备的功耗。
结合第二方面,在一些实施例中,在语音助手处于睡眠态的情况下,电子接收第四语音。电子设备可以确定第四语音与第三指令匹配,执行第三指令对应的操作。
其中,上述第三指令是与第一列表中的第二执行意图对应的语音指令。电子设备可以利用第一语音识别模型确定第四语音与第三指令(即第二执行意图)匹配。然后,电子设备可以执行第三指令对应的操作。
由上述方法可知,在未唤醒语音助手的情况下,电子设备可以实时检测用户说出的语音是否与第一列表中的意图匹配。当检测到用户说出的语音与第一列表中的意图匹配,电子设备可以直接执行该意图对应的操作。其中,第一列表包含与语音指令对应的意图。也即是说,用户可以直接向电子设备下达与第一列表中的意图对应的语音指令,而无需先唤醒语音助手。这可以提供用户的语音交互体验。
第三方面,本申请提供一种电子设备,该电子设备可包括麦克风、存储器、一个或多个处理器,其中,该麦克风可用于采集语音,该存储器可用于存储计算机程序,该一个或多个处理器可用于调用该计算机程序,使得该电子设备执行如第一方面或第二方面中任一可能的实现方法。
第四方面,本申请提供一种计算机可读存储介质,包括指令,当该指令在电子设备上运行,使得该电子设备执行如第一方面或第二方面中任一可能的实现方法。
第五方面,本申请提供一种计算机程序产品,该计算机程序产品可包含计算机指令,当该计算机指令在电子设备上运行,使得该电子设备执行如第一方面或第二方面中任一可能的实现方法。
第六方面,本申请提供一种芯片,该芯片应用于电子设备,该芯片包括一个或多个处理器,该处理器用于调用计算机指令以使得该电子设备执行如第一方面或第二方面中任一可能的实现方法。
可以理解地,上述第三方面提供的电子设备、第四方面提供的计算机可读存储介质、第五方面提供的计算机程序产品、第六方面提供的芯片均用于执行本申请实施例所提供的方法。因此,其所能达到的有益效果可参考对应方法中的有益效果,此处不再赘述。
附图说明
图1是本申请实施例提供的一种电子设备100的结构示意图;
图2是本申请实施例提供的一种电子设备100的软件结构框图;
图3是本申请实施例提供的一种语音交互系统30的框架图;
图4A~图4C是本申请实施例提供的一些语音交互的场景示意图;
图5A和图5B是本申请实施例提供的另一些语音交互的场景示意图;
图6是本申请实施例提供的一种语音交互方法的流程图;
图7A和图7B是本申请实施例提供的另一些语音交互的场景示意图;
图8是本申请实施例提供的一种调整执行意图列表方法的示意图;
图9是本申请实施例提供的一种调整执行意图列表方法的示意图。
具体实施方式
下面结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。其中,在本申请实施例的描述中,以下实施例中所使用的术语只是为了描述特定实施例的目的,而并非旨在作为对本申请的限制。如在本申请的说明书和所附权利要求书中所使用的那样,单数表达形式“一种”、“所述”、“上述”、“该”和“这一”旨在也包括例如“一个或多个”这种表达形式,除非其上下文中明确地有相反指示。还应当理解,在本申请以下各实施例中,“至少一个”、“一个或多个”是指一个或两个以上(包含两个)。术语“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系;例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A、B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。术语“连接”包括直接连接和间接连接,除非另外说明。“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。
在本申请实施例中,“示例性地”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性地”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性地”或者“例如”等词旨在以具体方式呈现相关概念。
本申请以下实施例中的术语“用户界面(user interface,UI)”,是应用程序(application,APP)或操作系统(operating system,OS)与用户之间进行交互和信息交换的介质接口,它实现信息的内部形式与用户可以接受形式之间的转换。用户界面是通过java、可扩展标记语言(extensible markup language,XML)等特定计算机语言编写的源代码,界面源代码在电子设备上经过解析,渲染,最终呈现为用户可以识别的内容。用户界面常用的表现形式是图形用户界面(graphic user interface,GUI),是指采用图形方式显示的与计算机操作相关的用户界面。它可以是在电子设备的显示屏中显示的文本、图标、按钮、菜单、选项卡、文本框、对话框、状态栏、导航栏、Widget等可视的界面元素。
在一些实施例中,电子设备可以实现“一次唤醒,连续对话”的语音交互方案。具体的,电子设备可以实时检测采集到的声音中是否包含唤醒词。当检测到唤醒词,电子设备可以唤醒语音助手,通过语音助 手对在唤醒词之后采集到的语音进行意图识别和动作执行。例如,用户说出唤醒词“小艺小艺”之后,进一步说出了语音指令“播放音乐”。电子设备检测到唤醒词后,可以唤醒语音助手来识别上述语音指令。当识别出上述语音指令对应的意图为播放音乐,电子设备可以播放音乐。
上述语音助手即为语音交互应用。上述语音助手还可以称为语音识别应用等名称。本申请实施例对此不作限定。
上述语音指令可以指用于控制电子设备执行一项或多项操作的语音。
其中,在唤醒语音助手之后,电子设备可以通过语音助手持续检测环境中的人声,并进行意图识别和动作执行。当在预设时间段内未检测到人声,电子设备可以将语音助手退出唤醒状态。语音助手退出唤醒状态后,需要再次响应唤醒操作而唤醒。上述唤醒语音助手的唤醒操作可以包括通过唤醒词唤醒、通过电子设备上的实体按键或虚拟按键唤醒等。本申请实施例对上述用于唤醒语音助手的唤醒操作不作限定。
也即是说,用户通过唤醒词唤醒语音助手后,可以连续向电子设备下达多条语音指令。电子设备可以识别这多条语音指令,并执行这多条语音指令对应的操作。在上述用户连续下达多条语音指令的期间,用户无需在下达每一条语音指令之前先说唤醒词。在用户在预设时间段内不再发声后,用户再次想要通过语音指令控制电子设备,则需要再次说出唤醒词唤醒语音助手。
可以看出,用户在唤醒语音助手后,可以与电子设备连续对话,实现与电子设备之间的多轮语音交互。这可以提高用户与电子设备进行语音交互的流畅性。但是在连续对话超时后,用户仍然需要先唤醒语音助手,再与电子设备进行语音交互。用户仍然无法在不说出唤醒词或者进行其它唤醒操作的情况下,随时通过语音控制电子设备。用户使用语音交互功能的体验较差。
在另一些实施例中,电子设备中可存储一个或多个固定命令词,例如,暂停播放、继续播放、上一首、下一首、上一集、下一集等。当检测到与上述固定命令词匹配的语音,电子设备可以执行上述固定命令词对应的操作。例如,电子设备正在播放音乐。当检测到语音“暂停播放”,电子设备可以确定该语音与固定命令词“暂停播放”匹配。那么,电子设备可以暂停播放当前正在播放的音乐。这样,用户不进行唤醒操作即可下达语音指令控制电子设备。
但在上述实施例中,电子设备存储的固定命令词通常是有限的。上述固定命令词通常是在指定场景中才能使用,例如视频播放场景、音乐播放场景。当用户下达的语音指令不与固定命令词匹配,电子设备将无法在语音助手未被唤醒时响应用户的语音指令。也即是说,用户下达固定命令词涵盖范围之外的语音指令时,仍然需要先唤醒语音助手。
本申请提供一种语音交互方法,实施该方法,用户可以无需唤醒语音助手,随时下达语音指令指示电子设备执行相应的操作。其中,电子设备中可存储有执行意图列表。该执行意图列表可包含用户常用语音指令对应的意图。电子设备可以在语音助手未被唤醒时运行低算力语音识别模型,来检测接收到的语音是否与执行意图列表中的意图匹配。当检测到接收到的语音与执行意图列表中的意图匹配,电子设备可以执行该语音匹配的意图对应的操作。另外,电子设备还可以唤醒语音助手,运行高算力语音识别模型,以便响应用户后续下达的语音指令。当检测到接收到的语音与执行意图列表中的意图不匹配,电子设备可以继续运行低算力语音识别模型,而不唤醒语音助手。
由上述方法可知,在未唤醒语音助手的情况下,电子设备可以实时运行低算力语音识别模型来检测用户是否说出常用语音指令。当检测到用户说出常用语音指令,电子设备可以直接执行该常用语音指令对应的操作。也即是说,用户可以直接向电子设备下达一些常用语音指令,而无需先唤醒语音助手。且在下达常用语音指令后,用户还可以进一步向电子设备下达更多的语音指令,在不进行唤醒操作的情况下与电子设备进行多轮语音交互。
在一些实施例中,电子设备中还可存储有扩展意图列表。该扩展意图列表可包含用户在表达常用语音指令时所说的非直接、有较高误识率的语音对应的意图。其中,扩展意图列表中的任意一个意图可以与执行意图列表中的一个或多个意图关联。扩展意图列表中的意图可以称为扩展意图。执行意图列表中的意图可以称为执行意图。例如,扩展意图列表中包含扩展意图:“我好热”。执行意图列表中包含执行意图:“打开空调”。扩展意图“我好热”可以与执行意图“打开空调”关联。当检测到接收到的语音与执行意图不匹配,电子设备可以检测接收到的语音是否与扩展意图匹配。当检测到接收到的语音与扩展意图匹配,电子设备可以提示用户说出与上述匹配的扩展意图关联的执行意图,以确认用户是否下达语音指令。进一步的,当接收到与上述执行意图匹配的语音,电子设备可以进行该执行意图对应的操作,并唤醒语音助手。
可以看出,在不唤醒语音助手的情况下,除了响应用户下达的直接且毫无疑义的常用语音指令,电子 设备还可以利用扩展意图列表分析用户所说的可能存在疑义的语音,判断用户是否想要下达语音指令。在确认用户想要下达语音指令后,电子设备可以执行用户想要下达的语音指令对应的操作。上述实施例可以提高在未唤醒语音助手的情况下识别用户语音指令的识别率,从而提高全时免唤醒场景下用户通过语音控制电子设备的使用体验。上述全时免唤醒即为用户在任意时刻下达语音指令都无需先进行唤醒语音助手的唤醒操作。
其中,上述低算力语音识别模型的算力等级低,功耗也低。电子设备在未唤醒语音助手的情况下实时运行低算力语音识别模型通常不会产生过高的功耗,从而也不会导致电子设备发热、运行卡顿等问题。本申请提供的语音交互方法可以在实现全时免唤醒的基础上,节约电子设备的功耗。
为了便于理解,下面对本申请涉及的一些概念进行介绍。
1、低算力语音识别模型和高算力语音识别模型
低算力语音识别模型和高算力语音识别模型均可用于进行语音识别,以便于电子设备在识别出语音指令后执行语音指令对应的操作,从而完成与用户的语音交互。
上述低算力语音识别模型的算力低于高算力语音识别模型的算力。上述算力可以指语音识别模型对数据进行处理、运算的能力。电子设备利用低算力语音识别模型进行语音识别的复杂度要低于利用高算力语音识别模型进行语音识别的复杂度。由于算力较低,电子设备利用低算力语音识别模型进行语音识别的识别率相较于利用高算力语音识别模型进行语音识别的识别率更低。并且,在同样的条件下,电子设备运行低算力语音识别模型产生的功耗要少于运行高算力语音识别模型产生的功耗。即低算力语音识别模型为低功耗语音识别模型。高算力语音识别模型为高功耗语音识别模型。低算力语音识别模型的大小通常小于高算力语音识别模型。即高算力语音识别模型在电子设备中需要占据更多的存储空间。
可以理解的,上述低算力和高算力表示的是相对的概念,不对本申请中语音识别模型的计算能力的大小构成具体限定。在一些实施例中,根据算力大小的不同,语音识别模型还可划分为更多算力等级的语音识别模型。通常的,语音识别模型的算力越高,语音识别模型的功耗也越高。
在一种可能的实现方式中,上述低算力语音识别模型和高算力语音识别模型均为基于神经网络的模型。神经网络可以包括输入层、隐藏层和输出层,且各层具有一个或多个节点。相比于高算力语音识别模型,低算力语音识别模型的隐藏层的层数和/或隐藏层的节点数更少。
在本申请中,电子设备可以通过低算力语音识别模型来检测用户的语音是否与预设的意图匹配,并在意图匹配的情况下执行该匹配的意图对应的操作。电子设备可以通过高算力语音识别模型来识别用户语音中的意图,判断用户是否下达语音指令,从而实现语音交互。
在一些实施例中,上述低算力语音识别模型可以部署在端侧,即电子设备上。上述高算力语音识别模型可以部署在端侧,还可以部署在云侧,即云服务器上。例如,语音交互的所有过程均可在电子设备上完成。当电子设备唤醒语音助手后,电子设备可以利用本地的高算力语音识别模型进行语音识别,来进行语音交互。再例如,语音交互可以通过端云结合的方案来完成。在未唤醒语音助手的情况下,电子设备可以利用本地的低算力语音识别模型进行语音识别,来进行语音交互。当电子设备唤醒语音助手后,电子设备可以与云服务器通信,利用云服务器上的高算力语音识别模型进行语音识别,来进行语音交互。
2、执行意图列表
执行意图列表可包含用户常用语音指令对应的意图。其中,意图可以表示用户想要做的事情。根据用户所说的一段语音来识别这一段语音对应的意图可以表示,识别用户说这一段语音想要做什么。例如,用户说出“打开空调”,用户的意图即为希望电子设备能够打开空调。电子设备识别出用户说出“打开空调”的意图后,可以打开空调。
电子设备可以根据执行意图列表,在未唤醒语音助手时快速响应用户说出的常用语音指令。上述常用语音指令可以包括使用频率高、误识率低、没有歧义的语音指令。上述误识率可以指将用户说出的不包含语音指令的语音误识别为语音指令的概率。
例如,在电子设备为车载电脑的场景中,用户可能经常会通过语音指令指示车载电脑开启/关闭车窗、开启/关闭空调、播放音乐、音量调节、导航等等。那么,常用语音指令可包括打开车窗、关闭车窗、打开空调、关闭空调、播放歌曲1、播放歌手1的歌、系统音量调大、导航去地点1等等。包含上述常用语音指令对应的意图的执行意图列表可以参考下述表1:

表1
上述表1仅为本申请实施例对执行意图列表的示例性说明,不应对执行意图列表构成限定。执行意图列表中还可以包含更多或更少的意图。在一些实施例中,上述执行意图列表中的意图还可以根据应用场景进行分类。例如,意图按应用场景可分类为车控类、设置类、音乐类、导航类等等。上述表1中的“打开车窗”、“关闭车窗”、“打开空调”、“关闭空调”可属于车控类的意图。“系统音量调大”可属于设置类的意图。“播放歌曲1”、“播放歌手1的歌”可属于音乐类的意图。“导航去地点1”可属于导航类的意图。本申请实施例对上述意图按应用场景划分的类别不作限定。在一种可能的实现方式中,在进行意图识别时,电子设备可以利用语音识别模型(如低算力语音识别模型、高算力语音识别模型)先识别接收到的语音对应哪个类别的意图,然后根据语音中的关键词确定该语音表达的含义。
可以看出,上述常用语音指令(即与执行意图匹配的语音指令)通常是毫无疑义的,能够明确指示电子设备执行某一项操作。上述常用语音指令也可以称为高频语音指令。
在一些实施例中,上述执行意图列表中的意图可以根据意图中有无实体划分为有实体意图和无实体意图。
实体可以指一种事物类别下的具体实例。例如,实体对应的事物类别可以包括以下一项或多项:歌名、歌手名、地点名、电影名、电视剧名、图书名、火车车次、航班号、电话号码、邮箱等等。示例性地,歌名的具体实例可以包括歌曲1、歌曲2、歌曲3等等。歌手名的具体实例可以包括歌手1、歌手2、歌手3等等。一种事物类别下的具体实例包含的范围较大,可能包含几个到几百万个实体不等。本申请实施例对上述实体对应的事物类别不作限定。上述实体对应的事物类别也可称为实体类别。
有实体意图即为包含实体的意图。其中,有实体意图可以由句式和实体组成。句式可以包含句式主体结构和实体占位符。实体占位符用于确定句式中用于放置实体的位置。有实体意图的句式可以支持在实体占位符所在的位置放置同一种事物类别下的任意实体。
示例性地,“播放歌手1的歌”为有实体意图。该有实体意图的句式可以为“播放[歌手名]的歌”。其中,“播放…的歌”为该句式的主体结构。“[歌手名]”为该句式的实体占位符。该实体占位符位于该句式的主体结构中“播放”与“的歌”之间。该有实体意图的句式可以支持在实体占位符所在的位置放置歌手名这一事物类别下的任意实体。例如,该实体占位符所在的位置上放置有“歌手2”,则有实体意图为“播放歌手2的歌”。该实体占位符所在的位置上放置有“歌手3”,则有实体意图为“播放歌手3的歌”。
由于有实体意图可以由句式和实体组成,上述执行意图列表可对应有句式列表和实体列表。其中,句式列表可以包括执行意图列表中有实体意图的句式。实体列表可以包括执行意图列表中有实体意图的实体。实体列表中的实体可以按照实体对应的事物类别进行分类。例如歌名类的实体、歌手名类的实体、地点名类的实体等等。
可以看出,上述表1中“播放歌曲1”、“播放歌手1的歌”、“导航去地点1”均为有实体意图。表1对应的句式列表可以参考下述表2:
表2
表1对应的实体列表可以参考下述表3:
表3
无实体意图即为不包含实体的意图。其中,执行意图列表中有实体意图之外的意图均为无实体意图。可以看出,上述表1中“打开车窗”、“关闭车窗”、“打开空调”、“关闭空调”、“系统音量调大”均为无实体意图。
在一些实施例中,上述执行意图列表可以是预设的。例如,电子设备在安装语音助手时,除了获取并存储低算力语音识别模型,还可以获取并存储预设的执行意图列表。可选的,电子设备在进行语音交互的过程中还可以通过自学习对执行意图列表进行调整,使得执行意图列表中包含的意图更贴近用户的常用语音指令,从而提升用户与电子设备进行语音交互的使用体验。上述电子设备自学习对执行意图列表进行调整的实现过程将在后续实施例中介绍,这里先不展开。可选的,电子设备还可以接收用户对执行意图列表进行调整的操作,来调整执行意图列表。
上述执行意图列表也可以称为第一列表。本申请实施例对执行意图列表的名称不作限定。
3、扩展意图列表
扩展意图列表可包含用户在表达常用语音指令时所说的非直接、有较高误识率的语音对应的意图。扩展意图列表中的任意一个扩展意图可以与执行意图列表中的一个或多个执行意图关联。
可以理解的,上述执行意图列表中的执行意图匹配的语音都是直接且毫无疑义的。电子设备接收到上述执行意图匹配的语音后,可以很明确地确定用户想要做的事情。在实际的语音交互场景中,用户在向电子设备下达语音指令时还可能说出存在疑义的语音。上述存在疑义的语音可以指电子设备接收到该语音后,不能确定用户是在下达语音指令,还是在非下达语音指令的场景中(如与其他人聊天的场景中)说出该语音。也即是说,上述存在疑义的语音有较高的误识率。若电子设备直接将上述存在疑义的语音确定为用户在下达语音指令,并执行相应的操作,有可能导致用户没有下达语音指令,而电子设备频繁响应来与用户进行语音交互,用户体验较差。然而,若电子设备直接将上述存在疑义的语音确定为用户没有下达语音指令,不进行响应,有可能导致用户实际上在下达语音指令,而电子设备迟迟没有反应,用户体验也较差。
示例性地,用户说出语音“我好热”。一种情况是,用户在下达语音指令。用户说出“我好热”,希望电子设备能打开空调。另一种情况是,用户没有下达语音指令。用户在与其它人聊天的过程中说出了“我好热”。电子设备识别出接收到的语音为“我好热”,可以进一步向用户确认用户是否想要下达语音指令。
电子设备可以根据扩展意图列表,检测接收到的语音是否与扩展意图列表中的扩展意图匹配。用户说出的语音与扩展意图匹配可以表示用户说出的语音存在疑义。当检测到接收到的语音与扩展意图匹配,电子设备可以根据扩展意图关联的执行意图,向用户确认用户是否想要实现与该扩展意图关联的执行意图。在确定用户想要实现与该扩展意图关联的执行意图后,电子设备可以进行该执行意图对应的操作,与用户进行语音交互。
例如,在电子设备为车载电脑的场景中,用户在车上可能会说出我好热、音量太小了等等存在疑义的语音。包含上述存在疑义的语音对应的意图的扩展意图列表可以参考下述表4:
表4
上述表4仅为本申请实施例对扩展意图列表的示例性说明,不应对扩展意图列表构成限定。扩展意图列表中还可以包含更多或更少的意图。
扩展意图列表中的任意一个扩展意图可以与执行意图列表中的一个或多个执行意图关联。例如,执行意图列表中包含执行意图“打开空调”。表4中的扩展意图“我好热”可以与执行意图“打开空调”关联。当电子设备识别出接收到的语音为“我好热”,电子设备可以提示用户说出“我好热”关联的执行意图“打开空调”。然后,当电子设备识别出接收到的语音为“打开空调”,电子设备可以进行执行意图“打开空调”对应的操作,即打开空调。再例如,执行意图列表中包含执行意图“系统音量调大”和执行意图“导航音量调大”。表4中的扩展意图“音量太小了”可以与执行意图“系统音量调大”关联,且与执行意图“导航音量调大”关联。当电子设备识别出接收到的语音为“音量太小了”。电子设备可以提示用户说出“音量太小了”关联的一个执行意图,即“系统音量调大”或“导航音量调大”。然后,当电子设备识别出接收到的语音为“导航音量调大”,电子设备可以进行执行意图“导航音量调大”对应的操作,即调大导航音量。
在一些实施例中,上述扩展意图列表可以是预设的。例如,电子设备在安装语音助手时,可以获取并存储预设的扩展意图列表。可选的,电子设备在进行语音交互的过程中还可以通过自学习对扩展意图列表 进行调整。例如,将扩展意图列表中的扩展意图添加至执行意图列表,从而将扩展意图调整为执行意图。可选的,电子设备还可接收用户对执行意图列表进行调整的操作,来调整执行意图列表。
上述扩展意图列表也可以称为第二列表。本申请实施例对扩展意图列表的名称不作限定。
4、语音助手的睡眠态和唤醒态
语音助手可以是电子设备中用于实现语音交互的应用。语音助手可以在电子设备出厂时预置在电子设备中。或者,语音助手还可以电子设备响应用户安装语音助手的操作,或电子设备系统更新时安装的。本申请实施例对电子设备安装语音助手的实现方法不作限定。
其中,电子设备在安装语音助手时,可以获取并存储于低算力语音识别模型、执行意图列表和扩展意图列表。可选的,电子设备在安装语音助手时,还可以获取并存储高算力语音识别模型。
在一些实施例中,语音助手的状态可包括睡眠态和唤醒态。在未唤醒语音助手的情况下,语音助手可以处于睡眠态。当语音助手处于睡眠态,电子设备可以运行低算力语音识别模型,来识别接收到的语音是否与执行意图匹配。当识别出接收到的语音与执行意图匹配,电子设备可以执行该执行意图对应的操作,并唤醒语音助手。其中,电子设备100在使用低算力语音识别模型识别用户语音中的意图时,可以借助上述执行意图列表和扩展意图列表。
当唤醒语音助手,电子设备可以将语音助手从睡眠态切换到唤醒态。当语音助手处于唤醒态,电子设备可以运行高算力语音识别模型,来识别接收到的语音对应的意图,从而与用户进行语音交互。其中,电子设备100在使用高算力语音识别模型识别用户语音中的意图时,可以不用借助执行意图列表和扩展意图列表。
当在预设时间段内未检测到人声或语音指令,电子设备可以将语音助手从唤醒态切换到睡眠态。
可以看出,电子设备可以在语音助手处于睡眠态时,运行低算力语音识别模型。电子设备可以在语音助手处于唤醒态时,运行高算力语音识别模型。由于低算力语音识别模型的功耗较低,电子设备在语音助手未被唤醒时持续运行低算力语音识别模型,可以实现用尽可能少的功耗,给用户带来全时免唤醒的体验。
在一些实施例中,上述睡眠态也可以称为第一状态。上述唤醒态也可以称为第二状态。本申请实施例对上述睡眠态和唤醒态的名称不作限定。
下面介绍本申请涉及的电子设备。
图1示例性示出了电子设备100的结构示意图。
如图1所示,电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194等。
可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些示例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。
USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。USB接口130可以用于连接充电器为电子设备100充电,也可以用于电子设备100与外围设备之间传输数据。也可以用于连接耳机,通过耳机播放音频。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电 器。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation,FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。
显示屏194用于显示图像,视频等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。
摄像头193用于捕获静态图像或视频。在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
NPU为神经网络(neural-network,NN)计算处理器,通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些示例中,音频模块170可以设置于处理器 110中,或将音频模块170的部分功能模块设置于处理器110中。扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。耳机接口170D用于连接有线耳机。
传感器模块180可以包括压力传感器,陀螺仪传感器,气压传感器,磁传感器,加速度传感器,距离传感器,接近光传感器,指纹传感器,温度传感器,触摸传感器,环境光传感器,骨传导传感器等。
按键190包括开机键,音量键等。马达191可以产生振动提示。指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
电子设备100可以是搭载鸿蒙系统(OS)或者其它操作系统的电子设备,例如,车载电脑、手机、平板电脑、笔记本电脑、智能手表、智能手环等等。本申请实施例对电子设备100的具体类型不作限定。
电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的系统为例,示例性说明电子设备100的软件结构。
图2是本申请实施例的电子设备100的软件结构框图。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图2所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,短信息、语音助手等应用程序。语音助手可以参考前述实施例的介绍。
应用程序框架层为应用程序层的应用程序提供API和编程框架。应用程序框架层包括一些预先定义的函数。
如图2所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器,活动管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏(如下拉通知栏)中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
活动管理器用于负责管理活动(activity),负责系统中各组件的启动、切换、调度以及应用程序的管理和调度等工作。活动管理器可供上层应用调用以打开对应的activity。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视 频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
本申请提供的语音交互方法可以应用在语音交互系统中。下面介绍本申请涉及的语音交互系统。
图3示例性示出了本申请提供的一种语音交互系统30的框架图。
如图3所示,语音交互系统30可包括电子设备100和云服务器200。其中,电子设备100和云服务器200之间可建立有通信连接。本申请实施例对电子设备100和云服务器200之间的通信方式不作限定。
电子设备100中可包含语音助手。语音助手可包含低算力语音识别模型和高算力语音识别模型。由前述实施例可知,语音助手还可包含执行意图列表(图3中未示出)和扩展意图列表(图3中未示出)。
云服务器200中可包含高算力语音识别模型。
在一些实施例中,语音交互的所有过程均可在电子设备100上完成。在未唤醒语音助手时,电子设备100可以运行本地的低算力语音识别模型,快速响应用户下达的语音指令。当唤醒语音助手,电子设备100可以利用本地的高算力语音识别模型进行语音识别,来进行语音交互。
在一些实施例中,语音交互可以通过端云结合的方案来完成。在电子设备100联网的情况下,电子设备100可以与云服务器200通信。例如,在未唤醒语音助手时,电子设备100可以运行本地的低算力语音识别模型,快速响应用户下达的语音指令。当唤醒语音助手,电子设备100可以与云服务器200通信(如向云服务器200发送接收到的语音,接收来自云服务器200的语音识别结果等),利用云服务器200上的高算力语音识别模型进行语音识别,来进行语音交互。再例如,当唤醒语音助手,电子设备100可以利用本地的高算力语音识别模型进行语音识别。电子设备100还可以与云服务器200通信,利用云服务器200上的高算力语音识别模型进行语音识别。电子设备100可以采用得到语音识别结果最快的语音识别结果。或者,电子设备100可以判断利用本地的高算力语音识别模型进行语音识别的准确率,与利用云服务器200上的高算力语音识别模型进行语音识别的准确率。电子设备100可以采用准确率更高的语音识别结果。本申请实施例对上述通过端云结合来实现语音交互的方法不作限定。
在一些实施例中,电子设备100中的语音助手也可不包含高算力语音识别模型。当唤醒语音助手,电子设备100可以与云服务器200通信,利用云服务器200上的高算力语音识别模型进行语音识别,来进行语音交互。
图3所示的语音交互系统30仅为本申请实施例的示例性说明。语音交互系统30还可以包含更多或更少的模块。
由上述语音交互系统30可知,在未唤醒语音助手时,电子设备100可以实时在本地运行低算力语音识别模型,以便于为用户提供无需唤醒语音助手即可下达语音指令的免唤醒体验。由于低算力语音识别模型的功耗较低,电子设备100长时间运行低算力语音识别模型对电子设备100的功耗影响较小。当语音助手处于唤醒态,电子设备100既可利用本地的高算力语音识别模型进行语音识别,还可以利用云服务器200中的高算力语音识别模型进行语音识别,以便于给用户带来无需进行唤醒操作即可与电子设备100进行多轮语音交互的体验。
本申请后续实施例中将以上述电子设备100为车载电脑作为示例,介绍车载场景下的语音交互方法。除了车载场景,本申请提供的语音交互方法还可适用于其他场景。
图4A~图4C示例性示出了本申请实施例提供的一种免唤醒的语音交互场景。
如图4A所示,在车载场景中,电子设备100中的语音助手可以处于睡眠态。电子设备100可以显示图4A所示的用户界面410。用户界面410可以为电子设备100的桌面。用户界面410可显示有应用程序图标(如导航应用图标、收音机应用图标、音乐应用图标等等)、时间控件等界面元素。本申请实施例对用户界面410上显示的内容不作限定。
当语音助手处于睡眠态,电子设备100可以运行低算力语音识别模型来识别检测到的语音是否与执行意图列表中的执行意图匹配。
如图4B所示,用户在车内向电子设备100下达语音指令“播放歌曲1”。电子设备100存储的执行意图列表中包含执行意图“播放歌曲1”。电子设备100可以检测到环境中的语音“播放歌曲1”。电子设备100可以利用低算力语音识别模型识别出该语音与执行意图“播放歌曲1”匹配。然后,电子设备100可 以进行该执行意图对应的操作,即调用音乐应用播放歌曲1。另外,电子设备100还可以唤醒语音助手,将语音助手从睡眠态切换至唤醒态。
示例性地,当识别出检测到的语音与执行意图匹配,电子设备100可以语音播报“好的,为您播放歌曲1”,并开始播放歌曲1。其中,电子设备100可以显示图4B所示的用户界面420。用户界面420可包含歌曲播放组件411和唤醒标识412。歌曲播放组件411可用于指示电子设备100当前正在播放的歌曲。例如,歌曲播放组件411中显示有歌曲名“歌曲1”,可以表示电子设备100当前正在播放歌曲1。歌曲播放组件411还可包含暂停控件、下一首控件和上一首控件,以便于用户通过歌曲播放组件411中的控件对电子设备100播放的音乐进行控制。歌曲播放组件411还可包含歌词(图中未示出)。本申请实施例对歌曲播放组件411中包含的内容不作限定。唤醒标识412可用于指示电子设备100中的语音助手处于唤醒态。也即是说,在语音助手处于唤醒态时,电子设备100可以在用户界面420显示唤醒标识412。
当语音助手处于唤醒态,电子设备100可以运行高算力语音识别模型,来识别接收到的语音对应的意图,从而与用户进行语音交互。其中,电子设备100在语音助手处于唤醒态时可以运行一个或多个高算力语音识别模型。本申请实施例对高算力语音识别模型的数量不作限定。
在一些实施例中,当语音助手处于唤醒态,电子设备100可以停止运行低算力语音识别模型。即低算力语音识别模型可以仅在语音助手处于睡眠态时运行。高算力语音识别模型可以仅在语音助手处于唤醒态时运行。
可以理解的,用户在说出与扩展意图匹配的语音后,有可能还会继续下达语音指令。因此,当识别出检测到的语音与扩展意图匹配,电子设备100可以唤醒语音助手,通过高算力语音识别模型更好地响应用户的后续请求,实现与用户的多轮语音交互。
如图4C所示,在语音助手处于唤醒态时,用户在车内向电子设备100下达语音指令“关闭车窗”。电子设备100可以检测到环境中的语音“关闭车窗”。电子设备100可以通过高算力语音识别模型识别出该语音的意图,并进行该意图对应的操作,即调用控制车窗的模块关闭车窗。
示例性地,当检测到语音指令“关闭车窗”,电子设备100可以语音播放“好的,正在为您关闭车窗”,并关闭车窗。其中,电子设备100可以显示图4C所示的用户界面430。用户界面430可以包含唤醒标识412和语音播报组件413。其中,语音播报组件413中可显示有电子设备100响应于用户的语音指令而语音播报的内容。
当在预设时间段内未检测到人声或语音指令,电子设备100可以将语音助手从唤醒态切换到睡眠态。这样可以避免在用户没有下达语音指令的情况下,电子设备100长时间运行高算力语音识别模型功耗过高。
由上述图4A~图4C所示的场景可知,用户在向电子设备100下达一些常用的语音指令时,可以无需进行唤醒语音助手的唤醒操作而直接下达语音指令。这可以帮助用户实现在一些常见场景中(如控制车内的硬件设备的场景、听歌场景、导航场景等等)快速对电子设备进行语音控制。并且,用户可以连续向电子设备100下达多个语音指令,与电子设备100进行多轮语音交互。在这多轮语音交互中,用户均可无需进行唤醒语音助手的唤醒操作。上述实施例可以提高用户与电子设备进行语音交互的流畅性。
图5A和图5B示例性示出了本申请实施例提供的另一种免唤醒的语音交互场景。
如图5A所示,电子设备100中的语音助手当前处于睡眠态。用户说出语音“我好热”。电子设备100存储的扩展意图列表中包含扩展意图“我好热”。并且,扩展意图“我好热”与执行意图列表中的执行意图“打开空调”关联。
电子设备100可以检测到环境中的语音“我好热”。电子设备100可以利用低算力语音识别模型识别出执行意图列表中没有与该语音匹配的执行意图。然后,电子设备100可以判断该语音是否与扩展意图列表中的扩展意图匹配。由于扩展意图列表中包含包括意图“我好热”,电子设备100可以识别出检测到的语音与扩展意图匹配。电子设备100可以向用户确认是否下达语音指令。具体的,电子设备100可以根据与扩展意图“我好热”关联的执行意图,显示图5A所示的用户界面510。用户界面510可包括提示框421。该提示框421中的提示内容可用于引导用户说出与扩展意图“我好热”关联的执行意图匹配的语音。例如,提示框421中的提示内容可以为:可以对我说“打开空调”。
如图5B所示,用户可以根据图5A所示提示框421的提示说出语音“打开空调”。电子设备100可以检测到环境中的语音“打开空调”。电子设备100中的语音助手仍处于睡眠态。电子设备100可以利用低算力语音识别模型识别出该语音与执行意图“打开空调”匹配。然后,电子设备100可以进行该执行意图对应的操作,即调用车内控制空调的模块打开空调。另外,当识别出检测到的语音与执行意图匹配,电子 设备100还可以唤醒语音助手,将语音助手从睡眠态切换至唤醒态。
示例性地,当识别出检测到的语音与执行意图匹配,电子设备100可以语音播报“好的,正在为您打开空调”,并调用控制空调的模块打开空调。其中,电子设备100可以显示图5B所示的用户界面520。用户界面520上可包含唤醒标识412。该唤醒标识412可用于指示语音助手处于唤醒态。
在一些实施例中,一个扩展意图可能关联有多个执行意图。例如,扩展意图“音量太小了”可以关联执行意图“系统音量调大”,还可关联有执行意图“导航音量调大”。当检测到的语音匹配的扩展意图关联多个执行意图,电子设备100可以提示用户说出这多个执行意图中的一个。例如,当检测到语音“音量太小了”,电子设备100可以在界面上显示提示框,提示用户说出“系统音量调大”,或说出“导航音量调大”。这样可以便于电子设备100判断用户是否想要下达语音指令,以及想要下达的语音指令对应的意图是什么。
本申请实施例对识别出检测到的语音与扩展意图匹配的情况下,电子设备100向用户确认是否下达语音指令的方式不作限定。
在一些实施例中,在电子设备100显示图5A所示的提示框421后,若电子设备100未检测到与执行意图“我好热”匹配的语音,则电子设备100可以仍保持语音助手处于唤醒态。可以理解的,用户在车内说出与扩展意图匹配的语音(如“我好热”),不一定表示用户在下达语音指令。如果用户进而按照电子设备100的提示说出相应的语音(如打开空调),则可以表示用户在车内说出与扩展意图匹配的语音是在下达语音指令。如果用户未按照电子设备100的提示说出相应的语音,则可以表示用户在车内说出与扩展意图匹配的语音不是在下达语音指令(例如可能是在与其他人聊天)。
由上述图5A和图5B所示的场景可知,通过扩展意图列表,电子设备还可以在响应用户说出的存在疑义的语音,提示用户说出更加直接且毫无疑义的语音指令(即与扩展意图关联的执行意图匹配的语音),从而确定用户是否下达语音指令。上述实施例可以在用户说出与扩展意图匹配的语音实际上是想要下达语音指令时,提示用户说出更为直接的说法,从而使得电子设备可以执行用户想要下达的语音指令对应的操作。而在用户聊天等非下达语音指令的场景中所说的语音与扩展意图列表匹配时,电子设备100的提示操作不会过度打扰用户的行为。上述实施例可以在不唤醒语音助手的情况下,实现既不会漏识别用户可能下达的语音指令,也不会对用户所说的非语音指令的语音误响应,提高用户的语音交互体验。
下面介绍电子设备100利用低算力语音识别模型来识别检测到的语音是否与执行意图匹配的方法。
在一些实施例中,低算力语音识别模型可以包括语音特征提取模型、声学模型和语言模型。电子设备100可以通过麦克风接收到语音输入。电子设备100可以利用语音特征提取模型提取该语音输入的语音特征。然后,电子设备100可以利用声学模型从上述语音特征中得到音素序列,实现从语音特征到字符的生成。可以理解的,同一个发音(即音素)可以对应多个不同的字符。通过声学模型,电子设备100可以得到与上述语音输入发音相同的多个候选字符。进一步的,基于声学模型输出的结果,电子设备100可以利用语音模型确定出最大概率的文字序列。即电子设备100可以对声学模型得到的多个候选字符进行组合,得到为上述语音输入表示的文字序列概率最大的一组文字序列。在电子设备100得到上述文字序列后,电子设备100可以判断执行意图列表中是否包含与上述文字序列匹配的执行意图。若是,则表明电子设备100检测到的语音与执行意图匹配。若否,则表明电子设备100检测到的语音不与执行意图匹配。
由前述实施例可知,执行意图可分为有实体意图和无实体意图。有实体意图可以由句式和实体组成。执行意图列表可对应有句式列表和实体列表。当电子设备100利用低算力语音识别模型对接收到的语音输入进行识别,得到最大概率的文字序列后,电子设备100可以判断该文字序列的句式是否存在于上述句式列表中。若该文字序列的句式存在于上述句式列表(即用户说出的语音命中句式),电子设备100可以判断该文字序列中的实体是否存在于上述实体列表中。其中,当判断出该文字序列的句式存在于上述句式列表,电子设备100可以根据句式中的实体占位符对应的实体类别,从实体列表中查找该实体类别下是否包含该文字序列中的实体。电子设备100可以无需将上述文字序列与执行意图列表中的全部有实体意图逐一进行比较,这可以简化电子设备100识别检测到的语音是否与执行意图匹配的过程。
在一些实施例中,用户在下达语音指令时,说出的内容不一定会与执行意图一字不差。例如,执行意图为“播放歌曲1”。用户在希望通过语音控制电子设备100播放歌曲1时,有可能使用“播放歌曲1”、“帮我播放歌曲1”、“帮我播放歌曲1吧”等等表述。可以看出,一些表述是在执行意图上增加了前缀和/或后缀。上述前缀和/或后缀并不影响上述表述所表达的意思。电子设备100中可存储有前后缀列表。例如,该前后缀列表中可包含前缀:“请”、“帮我”、“请帮我”以及后缀“了”、“吧”等等。本申请实施例对上述前缀和后缀不作具体限定。由前述实施例可知,电子设备100可以利用低算力语音识别模型对检测到的语 音进行识别,得到概率最大的文字序列。电子设备100可以判断该文字序列中是否有上述前后缀列表中的前缀和后缀。若有,则电子设备100可以将该文字序列中的前缀和/或后缀去除后,再与执行意图进行比较。电子设备100可以判断执行意图列表中是否包含与上述去除前缀和/或后缀的文字序列匹配的执行意图。若是,则表明电子设备100检测到的语音与执行意图匹配。若否,则表明电子设备100检测到的语音不与执行意图匹配。通过上述实施例,用户在不进行唤醒操作而下达语音指令的情况下,可以采用多种不同的说法,无需与扩展意图一字不差。这可以更好地提升用户使用语音交互功能的体验。
电子设备100还可以利用低算力语音识别模型来识别检测到的语音是否与扩展意图匹配,具体方法可以参考上述识别检测到的语音是否与执行意图匹配的方法。这里不再赘述。
基于上述免唤醒的语音交互场景,下面介绍本申请实施例提供的一种语音交互方法。
图6示例性示出了一种语音交互方法的流程图。如图6所示,该方法可包括步骤S611~S624。其中:
S611、检测到语音1。
电子设备100可以通过麦克风实时采集周边环境中的声音。当用户在电子设备100的附近说话,电子设备100可以通过处理器在所采集的声音中检测到语音,例如语音1。
S612、判断语音1是否包含唤醒词。
当检测到语音1,电子设备100可以判断该语音1是否包含唤醒词。其中,电子设备1可以利用唤醒语音识别模型来判断语音1是否包含唤醒词。上述唤醒语音识别模型与本申请中的低算力语音识别模型可以是同一个模型,或者,也可以是不同的模型。本申请实施例对判断语音1是否包含唤醒词的具体方法不作限定。
其中,当判断出语音1包含唤醒词,电子设备100可以执行下述步骤S613。当判断出语音1不包含唤醒词,电子设备100可以执行下述步骤S616。
情况1:(S613~S614)语音1为唤醒词,电子设备100唤醒语音助手。
S613、唤醒语音助手,运行高算力语音识别模型。
当检测到唤醒词,电子设备100可以唤醒语音助手。当语音助手处于唤醒态,电子设备100可以运行高算力语音识别模型。
S614、通过高算力语音识别模型识别检测到的语音中用户的意图,并执行该意图对应的操作。
在语音助手处于唤醒态时,电子设备100可以通过麦克风采集周边环境中的声音,通过处理器检测所采集的声音中包含的语音,并通过高算力语音识别模型识别检测到的语音中用户的意图,并执行该意图对应的操作。
示例性地,上述语音1为用户说出唤醒词(如“小艺小艺”)对应的语音。电子设备100识别出语音1包含唤醒词后,可以唤醒语音助手。电子设备100检测到语音1后,还检测到用户下达语音指令“打开车窗”对应的语音。电子设备100可以运行高算力语音识别模型识别检测到的语音指令对应的意图(即打开车窗)。然后,电子设备100可以调用控制车窗的模块打开车窗。
不限于通过上述唤醒词唤醒语音助手,电子设备100还可以响应其它的唤醒操作(如作用在实体按键或虚拟按键上的操作)唤醒语音助手。
也即是说,在本申请提供的语音交互方法中,用户仍可以通过唤醒操作先唤醒语音助手,然后再向电子设备100下达语音指令。电子设备100在语音助手被唤醒后,可以通过语音助手来识别语音指令,并执行语音指令对应的操作。
S615、在预设时间段内无语音交互,将语音助手切换到睡眠态。
电子设备100唤醒语音助手后,可以实时检测用户是否与电子设备100进行语音交互。若检测到用户与电子设备100进行语音交互,电子设备100可以保持语音助手处于唤醒态,识别语音指令并执行语音指令对应的操作。若在预设时间段内未检测到用户与电子设备100进行语音交互,电子设备100可以将语音助手切换到睡眠态,从而节省电子设备100的功耗。上述预设时间段可以例如是1分钟、2分钟等等。本申请实施例对该预设时间段的取值不作限定。
其中,在一种可能的实现方式中,电子设备100可以通过检测周边环境中是否有人声,来判断用户是否与电子设备100进行语音交互。电子设备100检测到周边环境有人声,可以表示存在语音交互。在另一种可能的实现方式中,电子设备100可以通过识别检测到的语音中是否包含用于控制电子设备100的语音指令,来判断用户是否与电子设备100进行语音教会。电子设备100识别出检测到的语音中包含语音指令,可以表示存在语音交互。
可以看出,在语音助手唤醒后,若用户持续与电子设备100进行语音交互,电子设备100可以保持语音助手处于唤醒态。这样,用户在与电子设备100进行多轮语音交互时,无需频繁进行唤醒操作。
S616、判断语音助手是否处于唤醒态。
当未检测到唤醒词,电子设备100可以判断语音助手当前是否处于唤醒态。
其中,若电子设备100检测到上述语音1时,语音助手已经处于唤醒态,电子设备100可以执行上述步骤S614。具体的,电子设备100可以通过高算力语音识别模型识别语音1中用户的意图,并执行该意图对应的操作。
若电子设备100检测到上述语音1时,语音助手不处于唤醒态,电子设备100可以执行下述步骤S617。
情况2:(S617~S619)语音1为与执行意图匹配的语音。
S617、语音助手处于睡眠态,运行低算力语音识别模型。
语音助手不处于唤醒态,即语音助手处于睡眠态。
S618、通过低算力语音识别模型判断语音1是否与执行意图列表中的执行意图匹配。
电子设备100判断语音1是否与执行意图匹配的实现方法可以参考前述实施例的介绍。
若判断出语音1与执行意图列表中的执行意图匹配,电子设备100可以执行下述步骤S619。否则,电子设备100可以执行下述步骤S620。
S619、执行语音1匹配的执行意图对应的操作,唤醒语音识别模型。
示例性地,上述语音1可以是用户说出“播放歌曲1”对应的语音。执行意图列表中包含执行意图“播放歌曲1”。在检测到语音1时,电子设备100中的语音助手处于睡眠态。电子设备100可以利用低算力语音识别模型判断出语音1与执行意图“播放歌曲1”匹配。那么,电子设备100可以执行该执行意图“播放歌曲1”对应的操作,即开始播放歌曲1。
当判断出语音1与执行意图列表中的执行意图匹配,电子设备100还可以唤醒语音识别模型,以便于更准确地识别用户后续的语音。其中,唤醒语音助手后,电子设备100可以执行上述步骤S614。
上述步骤S617~S619可以参考前述图4A~图4C所示的场景。
情况3:(S620~S624)语音1为与扩展意图匹配的语音。
S620、判断语音1是否与扩展意图列表中的扩展意图匹配。
当判断出语音1与执行意图列表中的执行意图不匹配,电子设备100可以判断该语音1是否与扩展意图列表中的扩展意图匹配。其中,电子设备100可以仍然保持语音助手处于睡眠态。
若判断出语音1与扩展意图列表中的扩展意图匹配,电子设备100可以执行下述步骤S621。否则,电子设备100可以执行下述步骤S624。
S621、提示用户说出与扩展意图1关联的执行意图1,扩展意图1与语音1匹配。
电子设备100可以判断出语音1与扩展意图列表中的扩展意图1匹配。电子设备100可以在执行意图列表中确定出与扩展意图1关联的执行意图1。那么,为了确认用户说出语音1是否是在下达语音指令,电子设备100可以提示用户说出执行意图1。
示例性地,上述语音1可以为用户说出“我好热”对应的语音。扩展意图列表中包含扩展意图1“我好热”。执行意图列表中包含执行意图1“打开空调”。在检测到语音1时,电子设备100中的语音助手处于睡眠态。电子设备100可以利用低算力语音识别模型判断出语音1与上述扩展意图1匹配。为了确认用户说出语音1是否是希望电子设备100执行上述执行意图1对应的操作,电子设备100可以提示用户说出执行意图1,即提示用户说出“打开空调”。例如,前述图5A所示,电子设备100可以在屏幕上显示提示框421。
S622、检测到语音2。
电子设备100可以通过麦克风持续采集周边环境的声音。在经过上述步骤S621中提示用户说出执行意图1后,电子设备100检测到语音2。
S623、通过低算力语音识别模型判断出语音2与执行意图列表中的执行意图1匹配,进行执行意图1对应的操作,唤醒语音助手。
电子设备100中的语音助手仍然处于睡眠态。电子设备100可以通过低算力语音识别模型判断语音2是否与执行意图1匹配。
当判断出语音2与执行意图1匹配,电子设备100可以执行上述执行意图1对应的操作,并且唤醒语音助手。其中,唤醒语音助手后,电子设备100可以执行上述步骤S614。
当判断出语音2与执行意图1不匹配,电子设备100可以保持语音助手处于睡眠态。
这里仍以语音1为“我好热”,扩展意图1为“我好热”,执行意图1为“打开空调”作为示例进行说明。若在上述步骤S621之后,用户根据电子设备100的提示说出“打开空调”,则可以表示用户说出上述语音1是希望电子设备100打开空调。之后,电子设备100可以在检测到与执行意图1匹配的语音后,打开空调。这样可以减少漏识别用户可能下达语音指令的情况。若在上述步骤S621之后,用户没有理会电子设备100的提示,没有说出“打开空调”,则可以表示用户说出上述语音1并不是在下达语音指令。之后,电子设备100继续保持语音助手处于睡眠态,而不会开启空调。这样可以减少对用户所说的非语音指令的语音误响应的情况。
S624、保持语音助手处于睡眠态,运行低算力语音识别模型。
上述步骤S620~S624可以参考前述图5A和图5B所示的场景。
需要进行说明的是,在一些实施例中,上述步骤S616是可选的。例如,电子设备100中的语音助手当前处于睡眠态,电子设备100可以运行低算力语音识别模型,或者运行低算力语音识别模型和唤醒语音识别模型。当识别出检测到的语音1包含唤醒词,电子设备100可以唤醒语音助手(即步骤S613)。当识别出检测到的语音1不包含唤醒词,电子设备100可以利用低算力语音识别模型识别语音1是否与执行意图列表中的执行意图匹配(即步骤S618)。
在一些实施例中,上述步骤S618和步骤S620可以是同时执行的。本申请实施例对步骤S618和步骤S620的执行顺序不作限定。
由上述图6所示的方法可知,上述执行意图列表和扩展意图列表可以减少用户聊天的声音以及环境噪声等对电子设备100在未唤醒语音助手的情况下,准确识别语音指令的影响。这样,电子设备100在不唤醒语音助手的情况下也能快速响应用户的语音指令。用户可以无需唤醒语音助手,随时下达语音指令指示电子设备100执行相应的操作。当识别出检测到的语音与执行意图匹配,电子设备100除了执行该执行意图对应的操作,还可以唤醒语音助手。这样,电子设备100在语音助手唤醒后可以更加准确地是识别用户后续的请求,以便接受用户的语音控制。其中,电子设备100可以根据用户与电子设备100进行语音交互的情况将语音助手的状态在睡眠态和唤醒态之间进行切换。这样可以在给用户带来全时免唤醒的体验时,尽可能节约电子设备100的功耗。
在一些实施例中,电子设备100在进行语音交互的过程中还可以通过自学习对执行意图列表进行调整,使得执行意图列表中包含的意图更贴近用户的常用语音指令对应,从而提升用户与电子设备100进行语音交互的使用体验。
图7A和图7B示例性示出了本申请实施例提供的一种语音交互场景。
如图7A所示,在车载场景中,电子设备100中的语音助手可以处于睡眠态。用户在车内向电子设备100下达语音指令“播放歌曲2”。
电子设备100存储的执行意图列表中包含执行意图“播放歌曲1”,但不包含“播放歌曲2”。也即是说,执行意图列表对应的句式列表包含句式“播放[歌名]”。执行意图列表对应的实体列表包含“歌曲1”,但不包含“歌曲2”。其中,在一种可能的实现方式中,上述歌曲1可以是基于统计数据确定出的热门歌曲。上述歌曲2可以是基于统计数据确定出的非热门歌曲。一首歌曲是否为热门歌曲可以通过该歌曲的点播率来确定。实体列表中可预置有上述热门歌曲的歌名。即上述实体列表中可预置有“歌曲1”,而没有预置“歌曲2”。在另一种可能的实现方式中,电子设备100曾经响应过用户的语音指令而播放歌曲1。电子设备100可以将“歌曲1”添加至实体列表。电子设备100未曾接收到用于播放歌曲2的语音指令,则不会将“歌曲2”添加至实体列表。那么,实体列表包含“歌曲1”,但不包含“歌曲2”。在又一种可能的实现方式中,电子设备100曾经响应过用户的语音指令而播放歌曲1的次数超过预设次数,而电子设备曾经响应过用户的语音指令而播放歌曲2的次数没有超过预设次数。那么电子设备100可以将“歌曲1”添加至实体列表,而不将“歌曲2”添加至实体列表。本申请实施例对实体列表中包含的具体内容不作限定。后续实施例中以实体列表包含“歌曲1”,但不包含“歌曲2”作为示例进行说明。
电子设备100可以检测到环境中的语音“播放歌曲2”。电子设备100可以利用低算力语音识别模型识别出该语音的句式与句式列表中的句式“播放[歌名]”匹配(即该语音命中句式列表中的句式)。但是该语音中的实体与实体列表中的实体均不匹配(即该语音未命中实体列表中的实体)。根据上述识别结果,电子设备100可以指示用户再说一遍,并唤醒语音助手。
示例性地,电子设备100可以语音播报“对不起,我没听清,请再说一遍”。本申请实施例对电子设备100指示用户重复说一遍语音的方法不作限定。另外,由于唤醒了语音助手,电子设备100可以在图7A 所示的用户界面710上显示唤醒标识412。
如图7B所示,用户根据图7A所示电子设备100的指示再次说出“播放歌曲2”。电子设备100可以检测到环境中的语音“播放歌曲2”。语音助手当前处于唤醒态。电子设备100可以利用高算力语音识别模型识别出该语音中用户的意图,并执行该意图对应的操作。例如,电子设备100可以语音播报“好的,为您放歌曲2”,并开始播放歌曲2。电子设备100还可以显示图7B所示的用户界面720。用户界面720可包括语音播报组件711和歌曲播放组件712。上述语音播报组件711中可显示有电子设备100与用户进行语音交互时语音播报的内容。上述歌曲播放组件712可以参考前述图4B所示歌曲播放组件411的介绍。
在一些实施例中,由于上述语音“播放歌曲2”命中执行意图列表对应的句式列表中的句式,未命中执行意图列表对应的实体列表中的实体,电子设备100可以在执行意图列表对应的实体列表添加该语音中的实体“歌曲2”。实体“歌曲2”可以属于实体列表中歌名类的实体。
那么,当电子设备100再次在语音助手处于睡眠态时检测到语音“播放歌曲2”,由于此时实体“歌曲2”已经添加进了执行意图列表对应的实体列表,因此电子设备100可以利用低算力语音识别模型确定该语音与执行意图列表中的执行意图匹配,从而可以在不唤醒语音助手的情况下直接响应用户的语音指令,开始播放歌曲2。
在一些实施例中,在语音助手处于唤醒态时,若电子设备100检测到的语音命中执行意图列表对应的句式列表中的句式,未命中执行意图列表对应的实体列表中的实体,电子设备100可以将该语音中的实体添加至执行意图列表对应的实体列表中。
例如,在上述图7B所示,电子设备100播放歌曲2后,用户想将播放的歌曲切换到歌曲3。用户可以说出语音“播放歌曲3”。由于语音助手还处于唤醒态,电子设备100可以通过高算力语音识别模型识别出该语音中用户的意图,并执行该意图对应的操作(即播放歌曲3)。另外,由于执行意图列表中不包含执行意图“播放歌曲3”(即执行意图列表对应的实体列表中不包含实体“歌曲3”),电子设备100可以在执行意图列表对应的实体列表添加该语音中的实体“歌曲3”。实体“歌曲3”可以属于实体列表中歌名类的实体。
由上述图7A和图7B所示的场景可知,即便用户下达的语音指令当前不与执行意图列表中的执行意图匹配,用户可以在电子设备100的指示下实现对电子设备100的语音控制。上述语音交互的过程中用户无需进行唤醒操作。并且,电子设备100可以通过自学习可以在语音交互的过程中通过自学习,在执行意图列表中添加与用户下达的语意指令匹配的执行意图。这样,电子设备100后续再次检测到同样的语音指令时,可以在不唤醒语音助手的情况下快速响应用户的语音指令。也即是说,用户后续再次下达同样的语音指令时,可以不进行唤醒操作。
图8示例性示出了本申请实施例提供的一种电子设备100自学习来调整执行意图列表的方法。
如图8所示,电子设备100当前存储的执行意图列表可包括以下执行意图:“关闭车窗”、“打开空调”、“系统音量调大”、“播放歌曲1”、“播放歌手1的歌”、“导航去地点1”。其中,由前述实施例对执行意图的分类可知,“关闭车窗”、“打开空调”、“系统音量调大”为无实体意图。“播放歌曲1”、“播放歌手1的歌”、“导航去地点1”为有实体意图。有实体意图可以由句式和实体组成。那么上述执行意图列表可对应有句式列表和实体列表。该句式列表可包括以下句式:“播放[歌名]”、“播放[歌手名]的歌”、“导航去[地点名]”。该实体列表可包括歌名类的实体“歌曲1”、歌手类的实体“歌手1”、地点名的实体“地点1”。
上述执行意图列表中不包含执行意图“播放歌曲2”。即实体列表中不包含实体“歌曲2”。
图8所示的执行意图列表仅为本申请实施例的示例性介绍,不应对本申请构成限定。
S81、用户说出语音“播放歌曲2”。
S82、电子设备100根据执行意图列表,检测到用户的语音命中句式,未命中实体。
电子设备100中的语音助手处于睡眠态。当检测到上述S81中用户说出的语音,电子设备100可以利用低算力语音识别模型确定该语音的句式与执行意图列表对应的句式列表中的句式“播放[歌名]”匹配,且该语音的实体与执行意图列表对应的实体列表中的实体均不匹配。也即是说,上述S81中的语音命中句式,未命中实体。
S83、电子设备100指示用户重复说语音,唤醒语音助手,运行高算力语音识别模型。
电子设备100提示用户重复说语音的场景可以参考前述图7A所示的场景。
S84、用户再次说出语音“播放歌曲2”。
S85、电子设备100通过高算力语音识别模型识别出用户语音的意图,并执行该意图对应的操作(即 播放歌曲2),并将“歌曲2”添加至实体列表中。
如图8所示,经过上述自学习的过程,执行意图列表中增加了有实体意图“播放歌曲2”。该执行意图列表对应的实体列表中增加了实体“歌曲2”。电子设备100后续可以在语音助手处于睡眠态时,快速响应用户下达的语音指令“播放歌曲2”。
在一些实施例中,电子设备100可以缓存检测到的语音。当判断出检测到的语音命中句式,未命中实体,电子设备100可以从存储模块获取该语音。然后,电子设备100可以在语音助手唤醒后,利用高算力语音识别模型识别该语音中用户的意图,并执行该意图对应的操作。上述方法可以避免让用户再次说出相同的语音指令,提升用户与电子设备100进行语音交互的使用体验。
图8所示的自学习也可以称为实体自学习。可以理解的,有实体意图的句式可以支持在实体占位符所在的位置放置同一实体类别下的任意实体。电子设备100可以在用户使用语音助手的过程中检测用户下达的语音指令包含哪些实体,并将用户提及的实体添加至实体列表中。这样可以方便用户在不进行唤醒操作的情况下快速执行用户常用语音指令对应的操作,例如,播放用户常听的歌曲、导航去用户常去的地点等等。
可以看出,上述自学习对执行意图列表进行调整的方法可以使得执行意图列表中包含的意图更贴近用户的常用语音指令,从而提升用户与电子设备进行语音交互的使用体验。
在一些实施例中,用户在下达语音指令时所说的语音指令可能经常与扩展意图列表中的扩展意图匹配。由前述实施例可知,用户在说出与扩展意图匹配的语音后,用户还需要按照提示说出与该扩展意图关联的执行意图,才能控制电子设备100执行相应的操作。那么,当用户说出与扩展意图匹配的语音,并在电子设备100的提示下说出与该扩展意图关联的执行意图的频率超过预设频率,电子设备100可以将上述扩展意图移动至执行意图列表,从而将该扩展意图调整为执行意图。
图9示例性示出了本申请实施例提供的一种电子设备100自学习来调整执行意图列表的方法。
如图9所示,电子设备100当前存储的执行意图列表包括以下执行意图:“关闭车窗”、“打开空调”、“系统音量调大”、“播放歌曲1”、“播放歌手1的歌”、“导航去地点1”。其中,由前述实施例对执行意图的分类可知,“关闭车窗”、“打开空调”、“系统音量调大”为无实体意图。“播放歌曲1”、“播放歌手1的歌”、“导航去地点1”为有实体意图。
电子设备100当前存储的扩展意图列表包括以下扩展意图:“我好热”、“音量太小了”。
其中,扩展意图“我好热”与执行意图“打开空调”关联。扩展意图“音量太小了”与执行意图“系统音量调大”关联。
图9所示的执行意图列表、扩展意图列表,以及执行意图和扩展意图之间的关联关系仅为本申请实施例的示例性介绍,不应对本申请构成限定。
由前述图5A和图5B所示的场景可知,用户在电子设备100的附近说出语音“我好热”。电子设备100中的语音助手处于睡眠态。电子设备100可以利用低算力语音识别模型识别出检测到的语音与扩展意图“我好热”匹配。电子设备100可以提示用户说出与扩展意图“我好热”关联的执行意图“打开空调”。当检测到语音“打开空调”,电子设备100可以调用控制空调的模块打开空调。
在一些实施例中,当电子设备100检测到用户说“我好热”(即与扩展意图“我好热”匹配的语音),并在电子设备100的提示下说出“打开空调”(即与扩展意图“我好热”关联的执行意图)的频率超过预设频率,电子设备100可以将扩展意图“我好热”添加至执行意图列表,并将该扩展意图从扩展意图列表中移除。这样,该扩展意图“我好热”转化为执行意图。当电子设备100后续在语音助手处于睡眠态时检测到语音“我好热”,可以直接调用控制空调的模块打开空调。
如图9所示,经过上述自学习的过程,执行意图列表中增加了无实体意图“我好热”。扩展意图列表中移除了扩展意图“我好热”。
可以看出,上述自学习的方法可以将用户在下达语音指令场景中经常所说的语音匹配的扩展意图转化为执行意图,从而使得扩展意图列表中包含更多贴近用户在下达语音指令时习惯用语的意图。利用上述自学习后得到的执行意图列表和扩展意图列表,电子设备100可以更好地响应用户不进行唤醒操作而直接下达的语音指令,提升用户与电子设备100进行语音交互的使用体验。
在一些实施例中,若在用户使用电子设备100的语音交互功能的过程中发现执行意图列表中的一个或多个执行意图误识率较高,则可以将这一个或多个执行意图移动至扩展意图列表,从而将这一个或多个执 行意图调整为扩展意图。
其中,电子设备100可以检测执行一个执行意图对应的操作之后的预设时间段内,响应于用户操作将上述被执行的操作撤回或取消的频率。若该频率高于预设频率,则表示上述一个执行意图误识率较高。电子设备100可以将上述一个执行意图调整为扩展意图。或者,上述语音助手的运营人员可以在上述低算力语音识别模型的测试过程中发现一个或多个执行意图误识率较高。或者,上述语音助手的运营人员可以收集用户的反馈,根据用户的反馈确定一个或多个执行意图误识率较高。本申请实施例对上述确定执行意图是否误识率较高的实现方法不作限定。
将上述误识率较高的执行意图调整为扩展意图后,电子设备100检测到与该扩展意图匹配的语音时可以先向用户确认是否下达语音指令。在确认用户是下达语音指令的情况,电子设备100可以执行与该扩展意图对应的操作。上述方法可以减少在不唤醒语音助手而下达语音指令的场景中,将非语音指令的语音当做语音指令而导致的误识别情况,提升用户与电子设备100进行语音交互的使用体验。
可以理解的,在本申请实施例中描述的各个用户界面仅为示例界面,并不对本申请方案构成限定。在其它实施例中,用户界面可以采用不同的界面布局,可以包括更多或更少的控件,可以增加或减少其它功能选项,只要基于本申请提供的同一发明思想,都在本申请保护范围内。
需要说明的是,在不产生矛盾或冲突的情况下,本申请任意实施例中的任意特征,或任意特征中的任意部分都可以组合,组合后的技术方案也在本申请实施例的范围内。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。

Claims (24)

  1. 一种语音交互方法,所述方法应用于电子设备,其特征在于,所述电子设备包含语音助手,所述方法包括:
    在所述语音助手处于睡眠态的情况下,接收第一语音;
    确定所述第一语音与第一列表中的第一意图匹配,所述第一列表中包含一个或多个语音指令对应的意图;
    执行所述第一意图对应的操作;
    唤醒所述语音助手;
    在所述语音助手处于唤醒态的情况下,接收第二语音;
    识别所述第二语音中的第二意图,执行所述第二意图对应的操作。
  2. 根据权利要求1所述的方法,其特征在于,所述第一语音和所述第二语音均不包含用于唤醒所述语音助手的唤醒词。
  3. 根据权利要求1或2所述的方法,其特征在于,所述确定所述第一语音与第一列表中的第一意图匹配,具体包括:
    利用第一语音识别模型确定所述第一语音与第一列表中的第一意图匹配,所述第一语音识别模型在所述语音助手处于所述睡眠态时运行;
    所述识别所述第二语音中的第二意图,具体包括:
    利用第二语音识别模型识别所述第二语音中的第二意图,所述第二语音识别模型在所述语音助手处于所述唤醒态时运行;
    其中,所述第二语音识别模型的大小大于所述第一语音识别模型的大小。
  4. 根据权利要求1-3中任一项所述的方法,其特征在于,所述唤醒所述语音助手之后,所述方法还包括:
    在第一时间段内未接收到语音,将所述语音助手从所述唤醒态切换到所述睡眠态。
  5. 根据权利要求1-4中任一项所述的方法,其特征在于,所述第一列表对应第一句式列表和第一实体列表,所述第一句式列表包含一个或多个句式,所述第一实体列表包含一个或多个实体,所述第一列表中的一个或多个意图由所述第一句式列表中的句式与所述第一实体列表中的实体组成;所述方法还包括:
    在所述语音助手处于所述睡眠态的情况下,接收第三语音,
    确定所述第三语音的句式与所述第一句式列表中的第一句式匹配,且所述第一实体列表中没有与所述第三语音的第一实体匹配的实体;
    唤醒所述语音助手,
    在所述语音助手处于所述唤醒态的情况下,识别所述第三语音中的第三意图,并执行所述第三意图对应的操作,所述第三意图由所述第一句式和所述第一实体组成。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    在所述第一实体列表添加所述第三语音的所述第一实体。
  7. 根据权利要求6所述的方法,其特征在于,所述方法还包括:
    在所述语音助手处于所述睡眠态的情况下,接收第四语音,
    确定所述第四语音的句式与所述第一句式列表中的所述第一句式匹配,且所述第四语音的实体与所述第一实体列表中的所述第一实体匹配,其中,所述第四语音与所述第三意图匹配;
    执行所述第三意图对应的操作;
    唤醒所述语音助手。
  8. 根据权利要求1-7中任一项所述的方法,其特征在于,所述方法还包括:
    在所述语音助手处于所述睡眠态的情况下,接收第五语音,
    确定所述第五语音与第二列表中的第四意图匹配,所述第二列表中的一个意图与所述第一列表中的一个或多个意图关联,其中,所述第四意图与所述第一列表中的第五意图关联;
    提供第一提示,所述第一提示用于提示用户说出与所述第五意图匹配的语音。
  9. 根据权利要求8所述的方法,其特征在于,所述提供第一提示之后,所述方法还包括:
    接收第六语音,
    确定所述第六语音与所述第五意图匹配,执行所述第五意图对应的操作;
    唤醒所述语音助手。
  10. 根据权利要求8所述的方法,其特征在于,所述提供第一提示之后,所述方法还包括:
    在第二时间段内未接收到与所述第五意图匹配的语音,取消所述第一提示,保持所述语音助手处于所述睡眠态。
  11. 根据权利要求8-10中任一项所述的方法,其特征在于,所述提供第一提示,具体包括:
    在所述电子设备的用户界面中显示所述第五意图对应的文字信息,或者,通过语音播报提示用户说出与所述第五意图匹配的语音。
  12. 根据权利要求8-11中任一项所述的方法,其特征在于,所述第一列表包括第六意图,所述方法还包括:
    确定所述第六意图的误识率高于第一阈值,在所述第一列表中移除所述第六意图,并在所述第二列表中添加所述第六意图。
  13. 一种语音交互方法,所述方法应用于电子设备,其特征在于,所述电子设备包含语音助手,所述方法包括:
    在所述语音助手处于睡眠态的情况下,接收第一语音;
    响应于所述第一语音,提供第一提示,所述第一提示用于提示用户说出第一指令;
    接收第二语音,并确定所述第二语音与所述第一指令匹配,执行所述第一指令对应的操作。
  14. 根据权利要求13所述的方法,其特征在于,所述第一语音和所述第二语音均不包含用于唤醒所述语音助手的唤醒词。
  15. 根据权利要求13或14所述的方法,其特征在于,所述提供第一提示,具体包括:
    在所述电子设备的用户界面中显示所述第一指令对应的文字信息,或者,通过语音播报提示用户说出与所述第一指令匹配的语音。
  16. 根据权利要求13-15中任一项所述的方法,其特征在于,所述响应于所述第一语音,提供第一提示,具体包括:
    响应于所述第一语音,利用第一语音识别模型确定所述第一语音与所述第一指令关联,所述第一语音识别模型在所述语音助手处于所述睡眠态时运行;
    根据所述第一语音与所述第一指令的关联关系,提供所述第一提示。
  17. 根据权利要求16所述的方法,其特征在于,所述确定所述第二语音与所述第一指令匹配,具体包括:
    利用所述第一语音识别模型确定所述第二语音与所述第一指令匹配。
  18. 根据权利要求16所述的方法,其特征在于,所述确定所述第二语音与所述第一指令匹配之前,所述方法还包括:
    唤醒所述语音助手;
    所述确定所述第二语音与所述第一指令匹配,具体包括:
    在所述语音助手处于唤醒态的情况下,利用第二语音识别模型识别所述第二语音中的所述第一指令,所述第二语音识别模型在所述语音助手处于所述唤醒态时运行;
    其中,所述第二语音识别模型的大小大于所述第一语音识别模型的大小。
  19. 根据权利要求13-17中任一项所述的方法,其特征在于,所述确定所述第二语音与所述第一指令匹配之后,所述方法还包括:
    唤醒所述语音助手;
    在所述语音助手处于唤醒态的情况下,接收第三语音,并识别所述第三语音中的第二指令,执行所述第二指令对应的操作。
  20. 根据权利要求18或19所述的方法,其特征在于,所述唤醒所述语音助手之后,所述方法还包括:
    在第一时间段内未接收到语音,将所述语音助手从所述唤醒态切换到所述睡眠态。
  21. 根据权利要求13-20中任一项所述的方法,其特征在于,所述方法还包括:
    在所述语音助手处于睡眠态的情况下,接收第四语音;
    确定所述第四语音与第三指令匹配,执行所述第三指令对应的操作。
  22. 一种电子设备,其特征在于,所述电子设备包括:麦克风、存储器、一个或多个处理器,所述麦克风用于采集语音,所述存储器用于存储计算机程序,所述一个或多个处理器用于调用所述计算机程序, 使得所述电子设备执行权利要求1-12或13-21中任一项所述的方法。
  23. 一种计算机可读存储介质,包括指令,其特征在于,当所述指令在电子设备上运行,使得所述电子设备执行权利要求1-12或13-21中任一项所述的方法。
  24. 一种计算机程序产品,其特征在于,所述计算机程序产品包含计算机指令,当所述计算机指令在电子设备上运行,使得所述电子设备执行权利要求1-12或13-21中任一项所述的方法。
PCT/CN2023/101818 2022-06-25 2023-06-21 语音交互方法及相关装置 WO2023246894A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210728191.8 2022-06-25
CN202210728191.8A CN117334182A (zh) 2022-06-25 2022-06-25 语音交互方法及相关装置

Publications (1)

Publication Number Publication Date
WO2023246894A1 true WO2023246894A1 (zh) 2023-12-28

Family

ID=89290767

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/101818 WO2023246894A1 (zh) 2022-06-25 2023-06-21 语音交互方法及相关装置

Country Status (2)

Country Link
CN (1) CN117334182A (zh)
WO (1) WO2023246894A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020073288A1 (zh) * 2018-10-11 2020-04-16 华为技术有限公司 一种触发电子设备执行功能的方法及电子设备
US10649727B1 (en) * 2018-05-14 2020-05-12 Amazon Technologies, Inc. Wake word detection configuration
CN111161714A (zh) * 2019-12-25 2020-05-15 联想(北京)有限公司 一种语音信息处理方法、电子设备及存储介质
CN114327198A (zh) * 2020-09-29 2022-04-12 华为技术有限公司 控制功能推送方法及设备

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10649727B1 (en) * 2018-05-14 2020-05-12 Amazon Technologies, Inc. Wake word detection configuration
WO2020073288A1 (zh) * 2018-10-11 2020-04-16 华为技术有限公司 一种触发电子设备执行功能的方法及电子设备
CN111161714A (zh) * 2019-12-25 2020-05-15 联想(北京)有限公司 一种语音信息处理方法、电子设备及存储介质
CN114327198A (zh) * 2020-09-29 2022-04-12 华为技术有限公司 控制功能推送方法及设备

Also Published As

Publication number Publication date
CN117334182A (zh) 2024-01-02

Similar Documents

Publication Publication Date Title
EP3410435B1 (en) Electronic apparatus for recognizing keyword included in your utterance to change to operating state and controlling method thereof
US11670302B2 (en) Voice processing method and electronic device supporting the same
US11955124B2 (en) Electronic device for processing user speech and operating method therefor
US10706847B2 (en) Method for operating speech recognition service and electronic device supporting the same
US11551682B2 (en) Method of performing function of electronic device and electronic device using same
US20180374482A1 (en) Electronic apparatus for processing user utterance and server
KR20200059054A (ko) 사용자 발화를 처리하는 전자 장치, 및 그 전자 장치의 제어 방법
CN109101517B (zh) 信息处理方法、信息处理设备以及介质
CN115312068B (zh) 语音控制方法、设备及存储介质
US20200125603A1 (en) Electronic device and system which provides service based on voice recognition
WO2022143258A1 (zh) 一种语音交互处理方法及相关装置
WO2023246894A1 (zh) 语音交互方法及相关装置
US20230126305A1 (en) Method of identifying target device based on reception of utterance and electronic device therefor
US20220301542A1 (en) Electronic device and personalized text-to-speech model generation method of the electronic device
EP4343756A1 (en) Cross-device dialogue service connection method, system, electronic device, and storage medium
CN116524919A (zh) 设备唤醒方法、相关装置及通信系统
KR102584324B1 (ko) 음성 인식 서비스 제공 방법 및 이를 위한 장치
US20210110825A1 (en) Method of generating wakeup model and electronic device therefor
CN115841814A (zh) 语音交互方法及电子设备
WO2023207185A1 (zh) 声纹识别方法、图形界面及电子设备
WO2023207149A1 (zh) 一种语音识别方法和电子设备
US20220328043A1 (en) Electronic device for processing user utterance and control method thereof
US20220165267A1 (en) Electronic device and operating method for generating response to user input
CN117809625A (zh) 一种终端设备及双模型校验的唤醒方法
CN115691479A (zh) 语音检测方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23826539

Country of ref document: EP

Kind code of ref document: A1