CN109272991B

CN109272991B - Voice interaction method, device, equipment and computer-readable storage medium

Info

Publication number: CN109272991B
Application number: CN201811148245.3A
Authority: CN
Inventors: 贺学焱; 赵科; 欧阳能钧
Original assignee: Apollo Zhilian Beijing Technology Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd
Priority date: 2018-09-29
Filing date: 2018-09-29
Publication date: 2021-11-02
Anticipated expiration: 2038-09-29
Also published as: CN109272991A

Abstract

Embodiments of the present disclosure provide methods, apparatuses, devices, and computer-readable storage media for voice interaction. A method of voice interaction performed at an electronic device includes, in response to receiving a first voice command from a user, identifying an identity of the user based on the first voice command. The method also includes configuring a match threshold for matching the first voice command with a predetermined activation command based on the identified identity. The method also includes determining whether the first voice command matches a predetermined activation command based on a match threshold. In addition, the method includes causing the electronic device to enter an active state in which the electronic device is capable of voice interaction with a user in response to determining that the first voice command matches a predetermined activation command. In this way, the embodiment of the disclosure can improve the awakening rate of the registered user to the electronic device, and can effectively reduce the false awakening rate in a noise scene.

Description

Voice interaction method, device, equipment and computer-readable storage medium

Technical Field

The present disclosure relates generally to the field of speech recognition and, more particularly, to speech interaction methods, apparatus, devices and computer readable media.

Background

With the development of speech recognition technology, intelligent speech devices have been more commonly used in people's daily life, work, and even production processes. Examples of smart voice devices include smart phones, smart speakers, wearable devices, etc., which allow people to interact with them by voice. For power saving and false recognition reduction purposes, a smart voice device in standby mode typically needs to first detect a specific activation command (e.g., a wake-up word) issued by a user before entering an active state that enables voice interaction with the user. This process is also referred to as "voice wake-up". Voice wakeup can be implemented at lower power consumption, which detects a certain predefined wakeup word. And when the fact that the user speaks the awakening word is detected, activating the intelligent voice equipment, so that the intelligent voice equipment can perform normal voice interaction with the user.

The voice wake-up performance mainly comprises a wake-up rate and a false wake-up rate. The wake-up rate refers to a rate at which a wake-up word is successfully detected when present in a received voice command, and the false wake-up rate refers to a rate at which a voice command without a wake-up word is misjudged to be present. It is generally desirable to increase the wake-up rate of a voice device and reduce its false wake-up rate, thereby improving the user experience. However, in the conventional scheme, increasing the wake-up rate also entails increasing the false wake-up rate.

Disclosure of Invention

According to an example embodiment of the present disclosure, a scheme for voice interaction is provided.

In a first aspect of the disclosure, a voice interaction method performed at an electronic device is provided. The method includes, in response to receiving a first voice command from a user, identifying an identity of the user based on the first voice command. The method also includes configuring a match threshold for matching the first voice command with a predetermined activation command based on the identified identity. The method also includes determining whether the first voice command matches a predetermined activation command based on a match threshold. In addition, the method includes causing the electronic device to enter an active state in which the electronic device is capable of voice interaction with a user in response to determining that the first voice command matches a predetermined activation command.

In a second aspect of the disclosure, an apparatus for voice interaction is provided. The device includes: an identity identification module configured to identify an identity of a user based on a first voice command in response to receiving the first voice command from the user; a threshold configuration module configured to configure a matching threshold for matching the first voice command with a predetermined activation command based on the identified identity; a match determination module configured to determine whether the first voice command matches a predetermined activation command based on a match threshold; and an activation module configured to cause the electronic device to enter an activation state in response to determining that the first voice command matches a predetermined activation command, the electronic device being capable of voice interaction with a user in the activation state.

In a third aspect of the disclosure, an electronic device is provided that includes one or more processors and a memory device. The storage device is used to store one or more programs. The one or more programs, when executed by the one or more processors, cause the one or more processors to perform a method according to the first aspect of the disclosure.

In a fourth aspect of the present disclosure, a computer-readable medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the method according to the first aspect of the present disclosure.

It should be understood that the statements herein reciting aspects are not intended to limit the critical or essential features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 illustrates a schematic diagram of an example environment in which embodiments of the present disclosure can be implemented;

FIG. 2 shows a flow diagram of a voice interaction method according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a method of identifying a user identity in accordance with an implementation of the present disclosure;

FIG. 4 illustrates a flow diagram of a method of configuring a match threshold based on user identity in accordance with an implementation of the present disclosure;

FIG. 5 shows a schematic block diagram of an apparatus for voice interaction in accordance with an embodiment of the present disclosure; and

FIG. 6 illustrates a block diagram of a computing device capable of implementing various embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

In describing embodiments of the present disclosure, the terms "include" and its derivatives should be interpreted as being inclusive, i.e., "including but not limited to. The term "based on" should be understood as "based at least in part on". The term "one embodiment" or "the embodiment" should be understood as "at least one embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.

As mentioned above, for the purpose of saving power and reducing misrecognition, a smart voice device in standby mode typically needs to first detect a specific wake-up word uttered by a user before entering an active state in which voice interaction with the user is possible. When it is detected that the user speaks the wake-up word, the smart voice device may be activated, enabling normal voice interaction with the user.

To achieve voice wake-up, some conventional schemes typically record a training audio data set for a predetermined wake-up word and then use the training audio data set to train an acoustic model for the pronunciation of the predetermined wake-up word. The acoustic model may be used to determine a pronunciation similarity score between the input voice command and a predetermined wake word. If the similarity score exceeds a predetermined match threshold, then the wake-up is determined to be successful (i.e., a wake-up word is detected). If the similarity score does not exceed a predetermined match threshold, a wake failure is determined (i.e., no wake word is detected).

In these schemes, two ways are generally used to increase the wake-up rate: one way is to collect as much training audio data as possible for acoustic model training, thereby improving the coverage of the acoustic model; another way is to lower the matching threshold used to determine whether the wake-up was successful, so that more of the similarity score exceeds the matching threshold and is determined to be successful. The first approach will result in a significant increase in the training cost of the acoustic model, while the second approach will necessarily result in an increase in the false wake-up rate while increasing the wake-up rate by lowering the matching threshold. Furthermore, such pronunciation similarity-based matching schemes do not distinguish human voices, animal voices, environmental voices, or machine-synthesized voices well. Therefore, a higher false wake-up rate is likely to be caused in a relatively noisy environment.

According to an embodiment of the present disclosure, a voice interaction scheme is presented. The scheme extracts voiceprint information from a user's voice command and identifies the user's identity based on the extracted voiceprint information. The approach further configures a matching threshold for matching the voice command with the predetermined activation command in accordance with the identified user identity, wherein the matching threshold for registered users is set lower than the matching threshold for non-registered users. In this way, the embodiment of the disclosure can improve the wake-up rate of the registered user to the electronic device, and can effectively reduce the false wake-up rate in a noise scene.

Embodiments of the present disclosure will be described below in detail with reference to the accompanying drawings.

FIG. 1 illustrates a schematic diagram of an example environment 100 in which embodiments of the present disclosure can be implemented. The environment 100 may generally include an electronic device 110 and a user 120. Examples of electronic device 110 may include, but are not limited to, a smartphone, a smart speaker, a wearable device, etc., capable of voice interaction with a user. It should be understood that the description of the structure and function of environment 100 is for exemplary purposes only and does not imply any limitation as to the scope of the disclosure. Embodiments of the present disclosure may also be applied to environments with different structures and/or functions.

As shown in fig. 1, the electronic device 110 may include, for example, a voice capture device 111 and a voice processing device 112. Examples of speech capture devices 111 may include, but are not limited to, various microphones or microphone arrays, and the like. The voice capture device 111 may capture voice commands from the user 120 and pass the captured data to the voice processing device 112 for processing. For example, when the electronic device 110 is in a standby mode (also referred to as an "inactive state"), the speech processing arrangement 112 may determine whether a speech command from the user 120 matches a particular activation command. The predetermined activation command described herein may be a command including a predetermined wake-up word or the predetermined wake-up word itself. Examples of wake words are e.g. "Siri", "hello to a small degree", etc. When the speech processing apparatus 112 determines that the speech command from the user 120 matches a particular activation command, the electronic device 110 may be woken up to enter an active state. When the electronic device 110 is in the activated state, in response to receiving a subsequent voice command from the user from the voice capturing means 111, the voice processing means 112 may recognize the voice command and perform a corresponding operation, such as information query, music playing, etc., based on the recognized result.

The process performed at the electronic device 110 will be described in detail below with reference to fig. 2. Fig. 2 shows a flowchart of an example method 200 performed at the electronic device 110, in accordance with an embodiment of the present disclosure. For example, the method 200 may be performed by the speech processing apparatus 112 in the electronic device 110. The various actions of method 200 are described in detail below in conjunction with fig. 1. It is to be understood that method 200 may also include additional acts not shown and/or may omit acts shown. The scope of the present disclosure is not limited in this respect.

At block 210, in response to receiving the first voice command from the user 120, the voice processing device 112 identifies the identity of the user 120 based on the first voice command. The first voice command may be, for example, a voice command containing a predetermined wake-up word that the user 120 desires to activate the electronic device 110 by speaking to perform a voice interaction with the electronic device 110. In some embodiments, the speech processing apparatus 112 may determine whether the electronic device 110 is in an inactive state. Electronic device 110 is not capable of voice interaction with user 120 in the inactive state. When the speech processing apparatus 112 determines that the electronic device 110 is in the inactive state and receives the first speech command from the audio capture apparatus 111, the speech processing apparatus 112 may identify the identity of the user 120 based on the first speech command.

Additionally or alternatively, in some embodiments, the speech processing apparatus 112 may identify the identity of the user 120 based on the voiceprint information in the first voice command. As an example, fig. 3 illustrates a flow diagram of an example method 300 for identifying a user identity in accordance with an implementation of the present disclosure. For example, method 300 may be implemented as an example of block 210.

At block 310, the speech processing device 112 extracts first voiceprint information from the first voice command. The first voiceprint information can include, for example, a spectrum of sound waves extracted from the first voice command that is information specific to the user 120. Studies have shown that a person's voiceprint is not only specific, but also stable. When the person becomes adult, his voiceprint generally remains relatively stable for long periods of time. The voiceprints of the two are always different no matter how the other person deliberately mimics the person's voice and tone. Thus, the voiceprint can be used to identify the identity of the speaker. In some embodiments, the speech processing device 112 may utilize any known or later developed technique to extract first voiceprint information from the first voice command that is capable of identifying the identity of the user 120.

At block 320, the speech processing apparatus 112 obtains second voiceprint information for a registered user of the electronic device 110. The registered user described herein may be a legitimate user that has previously registered with the electronic device 110. In some embodiments, the second voiceprint information for the registered user may be pre-stored in a storage coupled to the electronic device 110. Therefore, the voice processing apparatus 112 can acquire the second voiceprint information of the registered user from the storage apparatus. Alternatively, in some embodiments, the voice information of the registered user may be pre-stored in a storage device coupled to the electronic device 110. The voice processing apparatus 112 may acquire the voice information of the registered user from the storage apparatus and extract the second voiceprint information therefrom (e.g., similar to the extraction of the first voiceprint information).

At block 330, the speech processing device 112 determines a voiceprint similarity between the first voiceprint information of the user 120 and the second voiceprint information of the registered user. Then, at block 340, the speech processing device 112 may compare the determined voiceprint similarity to a predetermined threshold. When the voiceprint similarity exceeds a predetermined threshold, the speech processing apparatus 112 may identify the user 120 as a registered user at block 350.

In some embodiments, the electronic device 110 may have multiple registered users. For example, voiceprint information for multiple users may be pre-stored at electronic device 110 (e.g., in a storage coupled to electronic device 110). In this case, the speech processing apparatus 112 may perform the method 300 for voiceprint information of each of the plurality of registered users. When the speech processing apparatus 112 determines that the first voiceprint information extracted from the first voice command matches the voiceprint information of any of the plurality of registered users (e.g., the voiceprint similarity exceeds a predetermined threshold), the speech processing apparatus 112 can identify the user 120 as a registered user.

Returning to FIG. 2, the method 200 proceeds to block 220, where the voice processing apparatus 112 configures a match threshold for matching the first voice command with the predetermined activation command based on the identity of the identified user 120. The match threshold will be used to determine whether the first voice command matches a predetermined activate command. As discussed previously, the high or low of the matching threshold can determine the sensitivity of the voice wake-up. When the matching threshold is lower, more voice commands will be determined to match the predetermined activate command, resulting in an increased wake-up rate.

Fig. 4 illustrates a flow diagram of an example method 400 of configuring a match threshold based on a user identity in accordance with an implementation of the present disclosure. For example, method 400 may be implemented as an example of block 220. At block 410, the speech processing device 112 determines whether the user 120 is identified as a registered user. If the user 120 is identified as a registered user, the speech processing apparatus 112 may configure the matching threshold as a first threshold at block 420. If the user 120 is not identified as a registered user, the speech processing apparatus 112 may configure the match threshold to a second threshold that exceeds the first threshold at block 430. In some embodiments, the first threshold and the second threshold may be predetermined matching thresholds for registered users and non-registered users, respectively. That is, the match threshold for registered users will be set lower than the match threshold for non-registered users. In this way, the embodiment of the disclosure can effectively improve the awakening rate of the registered user. Meanwhile, the matching threshold value for the unregistered user is high, so that the false wake-up rate in a noise scene can be effectively reduced. This is because the voiceprint information of noise is usually significantly different from that of a person and thus is not recognized as coming from a registered user.

Returning to FIG. 2, the method 200 proceeds to block 230, where the speech processing device 112 determines whether the first voice command matches a predetermined activation command based on the configured match threshold. In some embodiments, the speech processing device 112 may determine a similarity between the first speech command and a predetermined activation command. When the similarity exceeds the configured match threshold, the speech processing device 112 may determine that the first speech command matches a predetermined activation command.

The voice processing device 112 may determine a similarity between the first voice command and the predetermined activation command based on any known or to be developed technique and determine whether the first voice command matches the predetermined activation command by comparing the similarity to a determined match threshold. The following lists a number of possible examples for illustrative purposes only. It should be understood, however, that these examples are not to be construed as limiting the scope of the disclosure. Embodiments of the present disclosure are applicable to various other cases than the following examples.

In some embodiments, the speech processing device 112 may determine a similarity between the first speech command and the predetermined activation command based on the acoustic feature comparison. For example, the speech processing device 112 may extract a first acoustic feature from a first voice command. The "acoustic features" as described herein may include any one or any combination of syllables, utterance frequencies, sound intensities, loudness, pitch, signal-to-noise ratio, harmonic-to-noise ratio, frequency perturbations, amplitude perturbations, cepstral coefficients, and the like. For example, the extracted first acoustic feature may be represented in the form of a feature vector. Further, the speech processing device 112 may extract the first acoustic feature from the first speech command based on any known or to be developed technique. Similarly, the speech processing device 112 may obtain corresponding acoustic features (also referred to as "second acoustic features") of the predetermined activation command. In some embodiments, the speech processing device 112 may similarly extract the second acoustic feature from a pre-stored predetermined activation command. Alternatively, the second acoustic feature of the predetermined activation command may be extracted in advance and stored at the electronic device 110, so the speech processing apparatus 112 may directly acquire the second acoustic feature. For example, the second acoustic features may be stored using different forms of feature vectors, templates, acoustic models, and so on. In some embodiments, the speech processing device 112 may determine a similarity between the first speech command and the predetermined activation command by comparing the first acoustic feature and the second acoustic feature.

Alternatively, in some embodiments, the speech processing device 112 may obtain an acoustic model that is pre-trained for a predetermined activation command (e.g., a predetermined wake word). The acoustic model may model basic acoustic units of words, syllables, phonemes, etc. in a predetermined activation command to describe its statistical properties. The speech processing device 112 may input the first acoustic feature extracted from the first speech command into an acoustic model pre-trained for a predetermined activation command to obtain an acoustic model score. The score may reflect, for example, a pronunciation similarity between the first voice command and a predetermined activation command.

Alternatively, in other embodiments, the speech processing device 112 may obtain an end-to-end recognition model that is pre-trained for a predetermined activation command. That is, when acoustic features extracted from a certain voice command are input to the recognition model, the recognition model can directly output a result of whether the voice command matches a predetermined activation command. Usually, a discriminant network is provided in such a recognition model. For example, the recognition model may determine a confidence level that the voice command matches the predetermined activation command by calculating the acoustic features of the input, and the discrimination network may compare the confidence level to a set confidence level threshold to determine whether the voice command matches the predetermined activation command. In some embodiments, for example, the speech processing device 112 may configure the recognition model based on the determined matching threshold such that the discrimination network therein determines whether the speech command matches a predetermined activation command based on the matching threshold.

Additionally or alternatively, in other embodiments, the speech processing device 112 may determine whether the first speech command matches the predetermined activation command in any other technique or manner, such as, but not limited to, speech recognition techniques that combine both acoustic and language models, spam-word based speech recognition techniques, and so forth. In this case, the configured matching threshold will determine the matching success rate. That is, a lower match threshold may correspond to a higher match success rate, and a higher match threshold may correspond to a lower match success rate. Since the match threshold for the registered user is set lower than the match threshold for the non-registered user at block 220, the voice command from the registered user is made to have a higher match success rate with the predetermined activation command, while the voice command from the non-registered user is made to have a lower match success rate with the predetermined activation command.

At block 240, the speech processing arrangement 112 may cause the electronic device 110 to enter an active state when the speech processing arrangement 112 determines that the first speech command matches a predetermined activation command. The electronic device 110 is capable of voice interaction with the user 120 in the active state, e.g., responding to subsequent voice commands from the user 120.

Additionally or alternatively, when electronic device 110 enters the active state and does not receive the second voice command from user 120 within a threshold time interval, electronic device 110 will revert back to the inactive state. That is, if the user 120 desires to voice-interact with the electronic device 110 again, the user 120 needs to issue a predetermined activation command (e.g., speak a predetermined wake-up word) to cause the electronic device 110 to re-enter the active state.

As can be seen from the above description, the voice interaction scheme according to the embodiments of the present disclosure can extract voiceprint information from a voice command of a user and recognize the identity of the user based on the extracted voiceprint information. The approach further configures a matching threshold for matching the voice command with the predetermined activation command in accordance with the identified user identity, wherein the matching threshold for registered users is set lower than the matching threshold for non-registered users. In this way, the embodiment of the disclosure can improve the wake-up rate of the registered user to the electronic device, and can effectively reduce the false wake-up rate in a noise scene.

Fig. 5 shows a schematic block diagram of an apparatus 500 for voice interaction according to an embodiment of the present disclosure. For example, the speech processing apparatus 112 as shown in FIG. 1 may be implemented with an apparatus 500. As shown in fig. 5, the apparatus 500 may include an identity identification module 510 configured to identify an identity of a user based on a first voice command in response to receiving the first voice command from the user. The apparatus 500 may also include a threshold configuration module 520 configured to configure a matching threshold for matching the first voice command with a predetermined activation command based on the identified identity. The apparatus 500 may also include a match determination module 530 configured to determine whether the first voice command matches a predetermined activation command based on a match threshold. The apparatus 500 may also include an activation module 540 configured to cause the electronic device to enter an activation state in which the electronic device is capable of voice interaction with a user in response to determining that the first voice command matches a predetermined activation command.

In some embodiments, identity module 510 includes: a state determination unit configured to determine whether an electronic device is in an inactive state in which the electronic device is not capable of voice interaction with a user; and a first identity identification unit configured to identify the identity of the user based on the first voice command in response to the electronic device being in an inactive state and receiving the first voice command.

In some embodiments, identity module 510 includes: a first voiceprint acquisition unit configured to extract first voiceprint information of the user from a first voice command; a second fingerprint acquisition unit configured to acquire second fingerprint information of a registered user of the electronic device; a voiceprint similarity determination unit configured to determine a voiceprint similarity between the first voiceprint information and the second voiceprint information; and a second identity unit configured to identify the user as a registered user in response to the voiceprint similarity exceeding a predetermined threshold.

In some embodiments, the second fingerprint acquisition unit is configured to acquire the second fingerprint information from a storage device coupled to the electronic device.

In some embodiments, the threshold configuration module 520 includes: a first threshold configuration unit configured to configure a matching threshold as a first threshold in response to the user being identified as a registered user; and a second threshold configuration unit configured to configure the matching threshold as a second threshold in response to the user not being identified as a registered user, wherein the first threshold is lower than the second threshold.

In some embodiments, the match determination module 530 includes: a similarity determination unit configured to determine a similarity between the first voice command and a predetermined activation command; and a match determination unit configured to determine that the first voice command matches a predetermined activation command in response to the similarity exceeding a match threshold.

In some embodiments, the similarity determination unit is further configured to: extracting a first acoustic feature from the first voice command; extracting a second acoustic feature from the predetermined activate command; and determining a similarity between the first voice command and the predetermined activate command by comparing the first acoustic feature and the second acoustic feature.

In some embodiments, the match determination module 530 includes: a model configuration unit configured to configure a recognition model for recognizing a predetermined activation command with a matching threshold so that the recognition model determines whether the voice command matches the predetermined activation command based on the matching threshold; and a model application unit configured to determine whether the first voice command matches a predetermined activation command using the configured recognition model.

In some embodiments, the apparatus 500 further includes a deactivation module configured to cause the electronic device to enter an inactive state in response to the electronic device being in the active state and not receiving a second voice command from the user within a threshold time interval, the electronic device being unable to voice interact with the user in the inactive state.

Fig. 6 illustrates a schematic block diagram of an example device 600 that can be used to implement embodiments of the present disclosure. Device 600 may be used to implement electronic device 110 as shown in fig. 1. As shown, device 600 includes a Central Processing Unit (CPU)601 that may perform various appropriate actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)602 or loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The CPU 601, ROM 602, and RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The processing unit 601 performs the various methods and processes described above, such as the

methods

200, 300, and/or 400. For example, in some embodiments,

methods

200, 300, and/or 400 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by CPU 601, one or more steps of

methods

200, 300, and/or 400 described above may be performed. Alternatively, in other embodiments, CPU 601 may be configured to perform

methods

200, 300, and/or 400 by any other suitable means (e.g., by way of firmware).

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a load programmable logic device (CPLD), and the like.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

Further, while operations are depicted in a particular order, this should be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A voice interaction method performed at an electronic device, comprising:

in response to receiving a first voice command from a user, identifying an identity of the user based on the first voice command, the first voice command including a predetermined wake-up word for activating the electronic device;

configuring a match threshold for matching the first voice command with a predetermined activation command based on the identified identity;

determining whether the first voice command matches the predetermined activation command based on the match threshold; and

in response to determining that the first voice command matches the predetermined activation command, causing the electronic device to enter an activation state in which the electronic device is capable of voice interaction with the user.

2. The method of claim 1, wherein identifying the identity of the user comprises:

determining whether the electronic device is in an inactive state in which the electronic device is not capable of voice interaction with the user; and

in response to the electronic device being in the inactive state and receiving the first voice command, identifying the identity of the user based on the first voice command.

3. The method of claim 1, wherein identifying the identity of the user comprises:

extracting first voiceprint information of the user from the first voice command;

acquiring second voiceprint information of a registered user of the electronic equipment;

determining a voiceprint similarity between the first voiceprint information and the second voiceprint information; and

identifying the user as the registered user in response to the voiceprint similarity exceeding a predetermined threshold.

4. The method of claim 3, wherein obtaining the second voiceprint information comprises:

the second voiceprint information is obtained from a storage device coupled with the electronic device.

5. The method of claim 3, wherein configuring the match threshold comprises:

in response to the user being identified as the registered user, configuring the match threshold as a first threshold; and

in response to the user not being identified as the registered user, configuring the match threshold as a second threshold, wherein the first threshold is lower than the second threshold.

6. The method of claim 1, wherein determining whether the first voice command matches the predetermined activation command comprises:

determining a similarity between the first voice command and the predetermined activation command; and

in response to the similarity exceeding the match threshold, determining that the first voice command matches the predetermined activation command.

7. The method of claim 6, wherein determining the similarity comprises:

extracting a first acoustic feature from the first voice command;

extracting a second acoustic feature from the predetermined activation command; and

determining the similarity between the first voice command and the predetermined activate command by comparing the first acoustic feature and the second acoustic feature.

8. The method of claim 1, wherein determining whether the first voice command matches the predetermined activation command comprises:

configuring a recognition model for recognizing the predetermined activation command with the matching threshold such that the recognition model determines whether a voice command matches the predetermined activation command based on the matching threshold; and

utilizing the configured recognition model to determine whether the first voice command matches the predetermined activation command.

9. The method of claim 1, further comprising:

responsive to the electronic device being in the active state and not receiving a second voice command from the user within a threshold time interval, causing the electronic device to enter an inactive state in which the electronic device is unable to voice interact with the user.

10. An apparatus implemented at an electronic device, comprising:

an identity identification module configured to identify an identity of a user based on a first voice command in response to receiving the first voice command from the user, the first voice command including a predetermined wake word for activating the electronic device;

a threshold configuration module configured to configure a matching threshold for matching the first voice command with a predetermined activation command based on the identified identity;

a match determination module configured to determine whether the first voice command matches the predetermined activation command based on the match threshold; and

an activation module configured to cause the electronic device to enter an activation state in which the electronic device is capable of voice interaction with the user in response to determining that the first voice command matches the predetermined activation command.

11. The apparatus of claim 10, wherein the identity module comprises:

a state determination unit configured to determine whether the electronic device is in an inactive state in which the electronic device is not capable of voice interaction with the user; and

a first identity identification unit configured to identify the identity of the user based on the first voice command in response to the electronic device being in the inactive state and receiving the first voice command.

12. The apparatus of claim 10, wherein the identity module comprises:

a first voiceprint acquisition unit configured to extract first voiceprint information of the user from the first voice command;

a second fingerprint acquisition unit configured to acquire second fingerprint information of a registered user of the electronic device;

a voiceprint similarity determination unit configured to determine a voiceprint similarity between the first voiceprint information and the second voiceprint information; and

a second identity unit configured to identify the user as the registered user in response to the voiceprint similarity exceeding a predetermined threshold.

13. The apparatus of claim 12, wherein the second acoustic line acquisition unit is further configured to:

14. The apparatus of claim 12, wherein the threshold configuration module comprises:

a first threshold configuration unit configured to configure the matching threshold as a first threshold in response to the user being identified as the registered user; and

a second threshold configuration unit configured to configure the matching threshold as a second threshold in response to the user not being identified as the registered user, wherein the first threshold is lower than the second threshold.

15. The apparatus of claim 10, wherein the match determination module comprises:

a similarity determination unit configured to determine a similarity between the first voice command and the predetermined activation command; and

a match determination unit configured to determine that the first voice command matches the predetermined activation command in response to the similarity exceeding the match threshold.

16. The apparatus of claim 15, wherein the similarity determination unit is further configured to:

extracting a first acoustic feature from the first voice command;

17. The apparatus of claim 10, wherein the match determination module comprises:

a model configuration unit configured to configure a recognition model for recognizing the predetermined activation command with the matching threshold so that the recognition model determines whether the voice command matches the predetermined activation command based on the matching threshold; and

a model application unit configured to determine whether the first voice command matches the predetermined activation command using the configured recognition model.

18. The apparatus of claim 10, further comprising:

a deactivation module configured to cause the electronic device to enter a non-active state in which the electronic device is unable to voice interact with the user in response to the electronic device being in the active state and not receiving a second voice command from the user within a threshold time interval.

19. An electronic device, comprising:

one or more processors; and

storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to carry out the method according to any one of claims 1-9.

20. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.