CN111862963A

CN111862963A - Voice wake-up method, device and equipment

Info

Publication number: CN111862963A
Application number: CN201910295356.5A
Authority: CN
Inventors: 陈梦喆; 雷鸣
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2020-10-30
Anticipated expiration: 2039-04-12

Abstract

The embodiment of the invention provides a voice awakening method, a device and equipment, wherein the method comprises the following steps: receiving voice output by a user; determining a first score corresponding to the voice on a first reference decoding path through an awakening system, wherein the first reference decoding path is established according to a first awakening keyword defined by a user aiming at a target object; determining a second score for identifying whether the voice is a wake-up voice according to the threshold value corresponding to each phoneme on the first reference decoding path; and if the first score is larger than or equal to the second score, awakening the target object. In the above scheme, the wake-up system identifies the wake-up keyword or the wake-up voice on a phoneme unit with a finer dimension, that is, whether a piece of voice is the wake-up voice is determined based on a phoneme-level threshold, so that the wake-up system has higher universality and can accurately identify various wake-up keywords defined by different users.

Description

Voice wake-up method, device and equipment

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a voice wake-up method, apparatus, and device.

Background

With the continuous development of artificial intelligence, the man-machine interaction mode presents the characteristic of multi-mode, wherein the voice interaction mode is particularly supported by various products, and the products can be a certain device or a certain application program. Taking a certain device supporting a voice interaction mode as an example, at present, a common voice interaction mode is: when a user wants to use the device, the user needs to speak a wake-up voice to wake up the device, so that the device is switched from a sleep state to a working state, and then performs normal voice interaction with the device.

The wake-up voice is a voice that at least includes a wake-up keyword. For example, if the wake keyword is "pan you ' y", the user speaks "pan you ' y" or a voice such as "pan you ' y", and the voice is considered to be a wake voice, and the corresponding device switches from the sleep state to the working state.

At present, for a certain device, the wake-up keyword is often fixedly set by a developer in advance. In order to be able to recognize whether the speech spoken by the user is a wake-up speech for waking up a device, a wake-up system (alternatively referred to as a wake-up model) may be trained for recognition of the wake-up speech by the wake-up system. Under the condition that the awakening keyword is fixed, only a large amount of voice containing the awakening keyword needs to be collected to train the awakening system, at this time, the awakening system obtained by training generally has good performance on the awakening keyword, namely, the awakening system can accurately judge whether the input voice is the awakening voice containing the awakening keyword, however, if the awakening keyword is changed, the performance of the awakening system cannot be guaranteed, and a training sample set may not or slightly cover the new awakening keyword. In practical application, a user has a requirement for customizing the wake-up keyword, and similarly, if there is no or only a small amount of corpus samples containing the user-defined wake-up keyword in a training sample set of the wake-up system, the performance of the wake-up system will be greatly reduced.

Disclosure of Invention

The embodiment of the invention provides a voice awakening method, a voice awakening device and voice awakening equipment, which are used for detecting whether an audio stream contains user-defined awakening keyword voice or not in real time.

In a first aspect, an embodiment of the present invention provides a voice wake-up method, where the method includes:

receiving user voice;

determining a first score corresponding to the user voice on a first reference decoding path through a wake-up system, wherein the first reference decoding path is established according to a first wake-up keyword defined by a user aiming at a target object;

determining a second score for identifying whether the user voice is an awakening voice according to the threshold value corresponding to each phoneme on the first reference decoding path;

if the first score is greater than or equal to the second score, the target object is awakened.

In a second aspect, an embodiment of the present invention provides a voice wake-up apparatus, including:

the receiving module is used for receiving user voice;

the determining module is used for determining a first score corresponding to the user voice on a first reference decoding path through an awakening system, wherein the first reference decoding path is established according to a first awakening keyword defined by a user aiming at a target object; determining a second score for identifying whether the user voice is an awakening voice according to the threshold value corresponding to each phoneme on the first reference decoding path;

And the control module is used for awakening the target object if the first score is greater than or equal to the second score.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory stores executable codes, and when the executable codes are executed by the processor, the processor is caused to implement at least the voice wake-up method in the first aspect.

In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium, on which executable code is stored, and when the executable code is executed by a processor of an electronic device, the processor is enabled to implement at least the voice wakeup method in the first aspect.

In the embodiment of the present invention, a user may define a wake-up keyword for a target object (for example, an application or a device), and based on the user-defined operation on the wake-up keyword, a wake-up system may establish a reference decoding path corresponding to the wake-up keyword, where the reference decoding path is formed by a plurality of phonemes sequentially included in the wake-up keyword. Based on this, in practical application, after a user speaks a sentence of voice, the wake-up system first determines a first score corresponding to the voice on the reference decoding path, then determines a second score for identifying whether the voice is a wake-up voice according to thresholds corresponding to the phonemes on the reference decoding path, and further determines whether the voice is a wake-up voice by comparing the first score and the second score. Specifically, if the first score is greater than or equal to the second score, the voice is determined to be a wake-up voice, that is, the voice is determined to include a wake-up keyword voice, and the target object is woken up. In the above scheme, the wake-up system identifies the wake-up keyword or the wake-up voice on a phoneme unit with a finer dimension, that is, whether a piece of voice is the wake-up voice is determined based on a phoneme-level threshold, so that the wake-up system has higher universality and can accurately identify various wake-up keywords defined by different users.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a flow chart of a voice wake-up method according to an exemplary embodiment;

FIG. 2 is a flowchart of a method for determining a phoneme corresponding threshold according to an exemplary embodiment;

FIG. 3 is a schematic diagram of a phoneme threshold determination process provided by an exemplary embodiment;

FIG. 4 is a flow chart of a method of determining first target coordinates provided by an exemplary embodiment;

FIG. 5 is a schematic diagram of a phoneme threshold determination process provided by an exemplary embodiment;

fig. 6 is a schematic structural diagram of a voice wake-up apparatus according to an exemplary embodiment;

fig. 7 is a schematic structural diagram of an electronic device corresponding to the voice wake-up apparatus provided in the embodiment shown in fig. 6.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well. "plurality" generally includes at least two unless the context clearly dictates otherwise.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.

It is also noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a good or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such good or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a commodity or system that includes the element.

In addition, the sequence of steps in each method embodiment described below is only an example and is not strictly limited.

Fig. 1 is a flowchart of a voice wake-up method according to an exemplary embodiment, as shown in fig. 1, the method includes the following steps:

101. user speech is received.

102. And determining a first score corresponding to the user voice on a first reference decoding path through the awakening system, wherein the first reference decoding path is established according to a first awakening keyword defined by the user aiming at the target object.

103. And determining a second score for identifying whether the user voice is the awakening voice according to the threshold value corresponding to each phoneme on the first reference decoding path.

104. And if the first score is larger than or equal to the second score, awakening the target object.

The voice wake-up method provided by the text can be executed by an electronic device, and the electronic device can be a terminal device such as a PC, a notebook computer and the like, and can also be a server. The server may be a physical server including an independent host, or may also be a virtual server carried by a host cluster, or may also be a cloud server.

In the embodiment of the present invention, a user may customize a wake keyword for a target object (for example, an application or a device), where the user-defined wake keyword is referred to as a first wake keyword. In practical applications, the target object may provide a user interface for a user to perform a setting operation of the first wake-up keyword, for example, an input box is provided in an interface for setting the first wake-up keyword through the input box. Therefore, after the first wake-up keyword is set by the user, when a subsequent user wants to use the target object, the user needs to speak the wake-up voice corresponding to the first wake-up keyword first to switch the target object from the sleep state to the working state.

Since different users may have different respective self-defined wake-up keywords for the same target object, in order to support that different users can wake up the target object through the respective self-defined wake-up keywords, the embodiment of the present invention provides a wake-up system (or may also be referred to as a wake-up model) that can accurately identify any wake-up keywords, that is, provides a wake-up system with universality. In order to achieve this versatility, the wake-up system performs recognition of wake-up keywords, i.e., wake-up voices, at the level of phoneme units of finer dimensions, so that recognition of the wake-up keywords can be achieved on the basis of phoneme units even if the wake-up system has not learned a certain wake-up keyword ever. The phoneme unit may be a mono-phoneme unit or a polyphone such as a triphone unit.

For example, first, in response to a setting operation of a user on a first wake-up keyword, a wake-up system may determine a phoneme sequence corresponding to the first wake-up keyword according to a correspondence between words and phonemes described in a dictionary, and form a first reference decoding path corresponding to the first wake-up keyword from the phoneme sequence. When the first wake-up keyword is composed of a plurality of words, the phonemes corresponding to the plurality of words are sequentially arranged to form a first reference decoding path. The first reference decoding path may be understood as that, when the user actually speaks the first wake-up keyword, in the process of waking up the system to decode the speech, each phoneme on the first reference decoding path should be decoded in turn theoretically.

After the first reference decoding path is set, the user may wake up the target object by the customized first wake-up keyword in the subsequent process of using the target object, and the subsequent process of actually using the target object by the user is described below.

In practical applications, when a user speaks a voice, first, the wake-up system needs to determine a first score corresponding to the voice on the reference decoding path. The internal components of the wake-up system may include an acoustic model and a decoder, and when the first score is determined, the received speech may be framed to obtain a plurality of audio frames, where a duration of each audio frame ensures that each audio frame corresponds to only one phoneme, and then the obtained plurality of audio frames may be sequentially input to the acoustic model, and the acoustic model predicts respective phoneme probabilities corresponding to the plurality of audio frames, and then the decoder determines, according to the respective phoneme probabilities corresponding to the plurality of audio frames, the first score corresponding to the speech on the first reference decoding path. The first score can be understood as a first wake-up keyword corresponding to the first reference decoding path with a probability that the content included in the speech is.

In addition to the determination of the first score, a second score for identifying whether the received speech is the wake-up speech (i.e., determining whether the speech is the speech corresponding to the first wake-up keyword, or whether the content included in the speech corresponds to the first wake-up keyword) is determined according to the threshold corresponding to each phoneme on the first reference decoding path.

It is understood that the recognition of whether the received speech is the wake-up speech corresponding to the first wake-up keyword is actually a classification problem, and the classification problem can be generally implemented by setting a threshold value, where the second score is equivalent to a classification threshold value for implementing the classification recognition of whether the received speech is the wake-up speech corresponding to the first wake-up keyword. It should be emphasized that, in this embodiment, the threshold unit corresponding to the second score is a phoneme, that is, the second score may be determined by the threshold corresponding to each phoneme sequentially passing through on the first reference decoding path.

That is, the second score can be considered as the score that a speech needs to reach to be considered to contain the wake-up word, and the second score is determined by the threshold value corresponding to each phoneme on the first reference decoding path, and the threshold value corresponding to each phoneme can be understood as the score that the constituent phonemes should reach to reach the second score.

In an alternative embodiment, the second score may be determined as: and the sum of the threshold values corresponding to the phonemes on the first reference decoding path. It can be seen that the second score at this time is independent of the received speech, and it is equivalent that the second score can be applied to any received speech.

In another alternative embodiment, the second score may be determined according to a threshold corresponding to each phoneme on the first reference decoding path and a number of audio frames corresponding to each phoneme in a plurality of audio frames of the received speech. Specifically, the threshold corresponding to each phoneme is multiplied by the number of audio frames corresponding to the phoneme, and the product corresponding to each phoneme is accumulated to obtain the second score. For example, assume that the first reference decoding path sequentially includes a phoneme a and a phoneme B, where the threshold value for phoneme a is 0.6 and the threshold value for phoneme B is 0.4. Assume that the received speech includes audio frame 1, audio frame 2, audio frame 3, and audio frame 4, and assume that audio frame 1, audio frame 2, and audio frame 3 all correspond to phoneme a, i.e., all three audio frames include only phoneme 1, and assume that audio frame 4 corresponds to phoneme B, at which time, the second score is 0.6 × 3+0.4 × 1 is 2.2. It can be seen that in this alternative embodiment, the threshold value, i.e., the second score, for identifying whether the speech is the wake-up speech is determined by the phonemes forming the first wake-up keyword and the duration (characterized by the number of audio frames) used for each phoneme in the process of decoding the speech, and thus the refined threshold value determination mode can make the final identification better.

The determination process of the threshold corresponding to the phoneme mentioned above will be explained in the following embodiments.

After the first score and the second score are obtained, whether the received voice is the awakening voice is determined by comparing the first score with the second score. Specifically, when the first score is greater than or equal to the second score, determining that the voice is a wake-up voice, and further waking up the target object; otherwise, when the first score is smaller than the second score, the voice is determined not to be the awakening voice and not to respond.

In summary, the wake-up system recognizes the wake-up keyword or the wake-up voice on a phoneme unit with a finer dimension, that is, whether a piece of voice is the wake-up voice is determined based on a phoneme-level threshold, so that the wake-up system has more universality and can accurately recognize various wake-up keywords defined by different users.

The following describes a process of determining a threshold value corresponding to a phoneme.

Fig. 2 is a flowchart of a method for determining a threshold corresponding to a phoneme according to an exemplary embodiment, where as shown in fig. 2, the method may include the following steps:

201. and labeling the corresponding relation between the audio frames and the phonemes of the voice samples in the first voice sample set.

For a target object, the first set of speech samples may be constructed by collecting speech output by a large number of users during use of the target object.

For each voice sample in the first voice sample set, the text content corresponding to the voice sample may be manually marked in advance, and then, in combination with a dictionary which describes the correspondence between words and phonemes and is generated in advance, the text content corresponding to the voice sample is converted into a phoneme sequence.

In addition, the first speech sample set may be used to determine a threshold corresponding to each phoneme, and may also be used to train an acoustic model included in the wake-up system.

Whether the method is used for training the acoustic model or determining the threshold corresponding to each phoneme, each voice sample may be subjected to certain preprocessing, and the preprocessing process may at least include framing each voice sample, so as to obtain a one-to-one correspondence relationship between each audio frame and each phoneme.

The process of training the acoustic model based on the first speech sample set is consistent with the training process of the existing acoustic model, and is not repeated. The following only describes how to determine the threshold corresponding to each phoneme based on the first speech sample set and the trained acoustic model.

202. And generating a positive sample set and a negative sample set corresponding to any phoneme, wherein the positive sample set comprises audio frames which have a corresponding relation with the any phoneme, and the negative sample set comprises audio frames which do not have a corresponding relation with any higher phoneme.

Through labeling processing of each voice sample in the first voice sample set, a plurality of phonemes contained in the first voice sample set can be obtained, and determination processing of a corresponding threshold value is sequentially performed on each phoneme.

First, for any phoneme, assuming that the phoneme i is represented as phoneme i, an audio frame corresponding to the phoneme i is selected from a plurality of audio frames into which each speech sample in the first speech sample set is divided to form a positive sample set of the phoneme i, and other audio frames not corresponding to the phoneme i form a negative sample set of the phoneme i.

203. And traversing each sample in the positive sample set and the negative sample set, and outputting the prediction probability of the currently traversed sample under any phoneme through an acoustic model in the awakening system.

Assuming that the currently traversed sample is a sample j, inputting the sample j into an acoustic model, and predicting the probability of the sample j corresponding to each phoneme by the acoustic model. Here, the prediction probability of the sample j output by the acoustic model under the phoneme i is extracted.

Therefore, the prediction probability of each sample in the positive sample set or the negative sample set under the phoneme i can be obtained through an acoustic model. Further, each sample may be labeled with its prediction probability under phoneme i in the positive and negative sample sets.

204. And determining a first functional relation curve reflecting the missing recognition rate and the error recognition rate of any phoneme according to the prediction probabilities respectively corresponding to the samples in the positive sample set and the negative sample set.

205. And determining that the prediction probability corresponding to a first target coordinate on the first functional relation curve is a threshold corresponding to any phoneme, wherein the first target coordinate enables the missing recognition rate and the error recognition rate to accord with set conditions.

After the prediction probabilities respectively corresponding to the samples in the positive sample set and the negative sample set under the phoneme i are obtained, two indexes, namely a missing recognition rate and an error recognition rate, corresponding to the phoneme i can be calculated according to the prediction probabilities corresponding to the samples.

Here, it is assumed that the missed recognition rate is FR and the false call recognition is FA.

FR-1-number of correctly identified samples in the positive sample/total number of samples in the positive sample set.

FA is the number of misidentified samples in the negative sample/total number of samples in the negative sample set.

After the two indexes are obtained through calculation, a first functional relation curve reflecting the two indexes can be drawn in a set coordinate system, and FA can be taken as a horizontal axis and FR can be taken as a vertical axis, or vice versa.

It should be noted that, here, calculating the missing recognition rate and the error recognition rate corresponding to the phoneme i may be understood as that the missing recognition rate and the error recognition rate corresponding to the phoneme i under different prediction probabilities need to be calculated, where the different prediction probabilities are prediction probabilities corresponding to respective samples in the positive sample set and the negative sample set output by the acoustic model under the phoneme i.

One implementation of determining the first functional relationship curve is given below:

performing the following iterative process at least once until the reference probability value is updated to the upper value limit:

updating the reference probability value;

for any sample in the positive sample set and the negative sample set, if the prediction probability corresponding to the sample is greater than the current reference probability value and the sample belongs to the positive sample set, adding one to the number of correctly identified samples in the positive sample;

if the prediction probability corresponding to any sample is greater than the current reference probability value and the any sample belongs to the negative sample set, adding one to the number of the incorrectly identified samples in the negative sample;

Determining the missing recognition rate and the error recognition rate of the phoneme i under the current reference probability value according to the accumulated value of the number of correctly recognized samples in the positive sample and the accumulated value of the number of incorrectly recognized samples in the negative sample;

and determining a first functional relation curve according to the missing recognition rate and the error recognition rate of the phoneme i obtained under each reference probability value.

The reference probability value can be understood as a parameter variable, and the value covers the value range of the prediction probability, namely 0-1.

In an optional embodiment, the prediction probabilities corresponding to the samples in the positive sample set and the negative sample set may be sorted in a descending order or a descending order, so that the reference probability values may be updated by sequentially traversing the prediction probabilities, and at this time, the reference probability values are equivalent to the prediction probabilities that are sequentially updated into the prediction probability sequence.

In another alternative embodiment, the reference probability value may also be updated in sequence from an initial value according to a set step size, for example, the initial value is 0, the step size is 0.1, the reference probability value is updated to 0.1 for the second time, to 0.2 for the third time, and so on until the update is 1.

The above iterative process is illustrated below. Assume that initially, the reference probability value is a, and the update step size is 0.1. For any sample in the positive sample set and the negative sample set, assuming that the sample j is a sample j, if the prediction probability corresponding to the sample j is greater than a and the sample j belongs to the positive sample set, the number of correctly identified samples in the positive sample is increased by one, which indicates that the sample j originally belongs to the positive sample and is actually identified as the positive sample. However, if the prediction probability corresponding to the sample j is greater than a and the sample j belongs to the negative sample set, the number of the incorrectly recognized samples in the negative sample is increased by one, which indicates that the sample j originally belongs to the negative sample but is incorrectly recognized as a positive sample. After the above determination is performed on all the samples in the positive sample set and the negative sample set, the cumulative value of the number of correctly recognized samples in the positive samples (assumed to be C1) and the cumulative value of the number of incorrectly recognized samples in the negative samples (assumed to be C2) can be obtained. Then, in case the reference probability value takes a, a set of FA and FR values can be calculated: FR-1-C1/Z1, FA-C2/Z2, where Z1 and Z2 represent the total number of samples in the positive sample set and the total number of samples in the negative sample set, respectively.

And then, updating the reference probability value to be a +0.1, and repeating the judgment process to obtain a group of FA and FR values corresponding to a + 0.1. And iterating in this way, and guiding the reference probability value to be updated to an upper limit value, assuming that b is the reference probability value, and at this time, obtaining a set of FA and FR values corresponding to b.

After obtaining each group of FA and FR values corresponding to each reference probability value under a to b, coordinate points corresponding to these groups of FA and FR values can be located in the coordinate system, and then curves corresponding to these coordinate points are drawn, that is, the first functional relationship curve. It will thus be appreciated that a coordinate point on the first functional relationship curve corresponds to a reference probability value, i.e. a prediction probability.

After the first functional relation curve is obtained, it may be determined that the prediction probability corresponding to a first target coordinate on the first functional relation curve is a threshold corresponding to the phoneme i, where the first target coordinate makes the missing recognition rate and the false recognition rate meet set conditions.

Alternatively, the setting condition may be, for example, selecting a coordinate with the smallest FA or the smallest FR as the first target coordinate, or randomly selecting one of a plurality of coordinates satisfying that FA is smaller than a certain setting value and FR is smaller than the certain setting value as the first target coordinate, or the like. The purpose of the setting condition is to make a compromise between the false recognition rate and the false recognition rate.

In the above embodiment, the process of determining the threshold corresponding to any phoneme i is described as an example, and for other phonemes, the process is also performed, and a first functional relationship curve corresponding to each phoneme may be drawn in a coordinate system, so that the threshold corresponding to each phoneme may be obtained.

As shown in fig. 3, fig. 3 illustrates a first functional relationship curve corresponding to each of two phonemes, namely a phoneme i and a phoneme j, and illustrates a first target coordinate corresponding to each phoneme, where the first target coordinate is: (FA1, FR1), (FA2, FR 2). The prediction probabilities corresponding to the two coordinates are assumed to be: p1 and P2, so that the thresholds of phoneme i and phoneme j are P1 and P2, respectively. In addition, assuming that the first reference decoding path in the foregoing includes the phoneme i and the phoneme j in sequence, in an alternative embodiment, the second score in the foregoing is P1 × P2.

For the first functional relationship curve corresponding to any phoneme i, several ways of determining the first target coordinate on the first functional relationship curve are described in the foregoing embodiments, and another way of determining the first target coordinate is described below.

Fig. 4 is a flowchart of a method for determining coordinates of a first object according to an exemplary embodiment, and as shown in fig. 4, the method may include the following steps:

401. And marking positive and negative samples of each voice sample in the second voice sample set, wherein the voice sample containing the second awakening keyword is marked as a positive sample, and the voice sample not containing the second awakening keyword is marked as a negative sample.

402. And determining a third score corresponding to each voice sample in the second voice sample set on a second reference decoding path corresponding to the second awakening keyword through the awakening system.

403. And determining an intersection point between a target straight line and the first function relation curve of each phoneme, wherein the target straight line is a straight line which has a target angle threshold value with a preset coordinate axis of a coordinate system where the first function relation curve is located and passes through the coordinate origin.

404. And determining a fourth score for identifying whether the voice sample is the awakening voice according to the threshold value which is respectively corresponding to each current phoneme contained in the second reference decoding path, wherein the threshold value which is respectively corresponding to each current phoneme is the prediction probability which is respectively corresponding to the intersection point on the corresponding first function relation curve.

405. And for any voice sample in the second voice sample set, determining the false awakening rate and the missed awakening rate of the target angle threshold according to the awakening voice recognition result obtained by comparing the third fraction and the fourth fraction of the voice sample and the positive and negative sample mark information of the voice sample.

406. And if the false awakening rate and the missed awakening rate of the target angle threshold value meet set conditions, determining an intersection point between the target straight line and the first functional relation curve of each phoneme as a first target coordinate on the corresponding first functional relation curve.

In this embodiment, the second speech sample set is used to assist in determining the threshold corresponding to each phoneme (i.e. determining the first target coordinate on the first functional relationship curve corresponding to each phoneme), and may also be used to detect the performance of the wake-up system. Because the training of the first speech sample set in the embodiment shown in fig. 2 is completed, the acoustic model in the wake-up system can be tested to determine whether the training result is reliable.

For the target object, the tester can define the awakening keyword as a second awakening keyword, and collect the voice containing the second awakening keyword as a positive voice sample and other voices not containing the second awakening keyword as negative voice samples to form the second voice sample set.

For example, assume the second wake-up keyword is: fairy baby. Then the voices containing the continuous four words of the genius baby, such as the genius baby, the opening genius baby, the genius baby hello and the like, are all regarded as positive voice samples, and other voice samples are negative voice samples.

As described in the foregoing embodiment, in response to the setting operation of the second wake-up keyword by the tester, the wake-up system may generate a second reference decoding path corresponding to the second wake-up keyword, where the second reference decoding path is formed by a phoneme sequence included in the second wake-up keyword.

After the second voice sample set is obtained, all voice samples in the second voice sample set are sequentially input into the awakening system, and third scores, corresponding to all voice samples in the second voice sample set, on a second reference decoding path corresponding to the second awakening keyword are determined through the awakening system. The process of determining the third score is the same as the process of determining the first score in the foregoing embodiment, and is not described herein again.

In this embodiment, a core idea of finding a first target coordinate corresponding to each phoneme on a first functional relationship curve corresponding to each phoneme is as follows: if a target angle threshold can be determined in the coordinate system in which the first functional relationship curve is located, a target straight line which passes through the origin of coordinates and is at the target angle threshold with respect to a predetermined coordinate axis of the coordinate system, such as the horizontal axis, will have an intersection with each first functional relationship curve. If the intersection point on each first functional relation curve is taken as the first target coordinate corresponding to the corresponding phoneme, the prediction probability corresponding to the intersection points is the threshold corresponding to the corresponding phoneme, an expected score (which is a fourth score in the following) is determined according to the threshold corresponding to each phoneme at the time, the expected score is taken as a basis for identifying whether each voice sample in the second voice sample set corresponds to the awakening voice or not, that is, whether the second awakening keyword is included or not, and if the finally obtained identification result based on the expected score is good in performance, each intersection point can be considered as the first target coordinate on the corresponding first functional relation curve. The performance of the identification result can be judged by the following two indexes: a missed wake-up rate and a false wake-up rate. Assuming that the missed wake-up rate is denoted as FR 'and the false wake-up rate is denoted as FA', then:

FR ═ 1 — number of correctly awakened samples in positive speech samples/total number of positive speech samples.

FA ═ number of erroneously awakened samples in negative speech samples/total number of negative speech samples.

Assuming that the second set of speech samples includes N1 positive speech samples and N2 negative speech samples, the total number of positive speech samples is N1 and the total number of negative speech samples is N2.

Therefore, if the first target coordinates on each first functional relationship curve are to be obtained, the corresponding target angle threshold value at which the missed wake-up rate and the false wake-up rate meet the set requirements needs to be found first.

Assuming that the target angle threshold has been found, it is understood that the threshold corresponding to each phoneme is determined when the target angle threshold has been found, and then the fourth score for identifying whether the speech sample is a wake-up speech is determined from the current threshold corresponding to each phoneme. Wherein, the threshold value respectively corresponding to each phoneme at present is the prediction probability respectively corresponding to the intersection point of the target straight line and the corresponding first function relation curve.

In an alternative embodiment, the fourth fraction may be determined as: and the products of the threshold values corresponding to the phonemes on the second reference decoding path respectively. It can be seen that the fourth score is not related to the speech samples, and thus the fourth score can be applied to any speech sample in the second speech sample set.

In another alternative embodiment, for any speech sample in the second speech sample set, the fourth score for identifying whether the any speech sample is a wake-up speech may be determined according to a threshold value currently corresponding to each phoneme included in the second reference decoding path and a number of audio frames corresponding to each phoneme in a plurality of audio frames of the any speech sample. At this time, the fourth scores corresponding to each speech sample may be different.

In the following, it is assumed that the target angle threshold has been found, and how to determine the miss-wake-up rate and the false-wake-up rate of the target angle threshold is described.

For any voice sample in the second voice sample set, assuming that the voice sample is a voice sample k, and if the third fraction of the voice sample k is greater than the fourth fraction and the voice sample k is marked as a positive sample, adding one to the correct awakening times; if the third fraction of the voice sample k is greater than the fourth fraction and the voice sample k is marked as a negative sample, the number of false wakeups is increased by one. Therefore, according to the correct awakening time accumulated value and the error awakening time accumulated value corresponding to all the voice samples in the second voice sample set, the missed awakening rate and the error awakening rate corresponding to the target angle threshold can be determined.

How to find the target angle threshold is described next.

In the process of determining the target angle threshold, the following iterative process needs to be executed at least once until the reference angle value is updated to the upper value limit:

updating the reference angle value;

determining the missed awakening rate and the false awakening rate of the current reference angle value according to the correct awakening time accumulated value and the false awakening time accumulated value corresponding to all the voice samples in the second voice sample set under the current reference angle value;

determining a second function relation curve reflecting the missed awakening rate and the false awakening rate of each reference angle value according to the obtained missed awakening rate and the false awakening rate of each reference angle value;

and determining a second target coordinate which meets the set condition on the second functional relation curve, wherein a reference angle value corresponding to the second target coordinate is used as a target angle threshold value.

Therefore, in the process of determining the target angle threshold, multiple iteration processes are required to obtain the missed wake-up rate and the false wake-up rate under different reference angle values, so that a second function relation curve is drawn based on the obtained pairs of the missed wake-up rate and the false wake-up rate. It can be seen that the determination of the second functional relationship is similar to the determination of the first functional relationship.

Specifically, taking an example that the reference angle value is an included angle between a straight line passing through the origin of coordinates and a horizontal axis, the reference angle value ranges from 0 to 90 degrees. The above iterative process may be: starting with 0 degree, successively updating the reference angle values by a certain step length, such as 5 degrees, 10 degrees and the like, until the reference angle values are updated to 90 degrees, and searching the target angle threshold value from the traversed reference angle values.

The following describes schematically a specific implementation of the above iterative process with reference to fig. 5.

Fig. 5 illustrates a first functional relationship curve corresponding to each of the phoneme i and the phoneme j. Assuming that the currently updated reference angle value is a1, and thus the intersection points of the corresponding straight line L1 and the two first functional relationship curves are as shown in fig. 5, it is assumed that the prediction probabilities respectively corresponding to the two intersection points, that is, the threshold values respectively corresponding to the two phonemes under the reference angle value a1 are: p1, P2. Assuming that the product of the two prediction probabilities is used as the fourth fraction, by comparing the third fraction corresponding to each voice sample in the second voice sample set with the fourth fraction, the correct wake-up number accumulated value and the false wake-up number accumulated value corresponding to all the voice samples at that time can be obtained, and based on these, the missed wake-up rate and the false wake-up rate of the reference angle value a1 can be obtained.

Then, it is assumed that the updated reference angle value is updated to a2, so that the intersection points of the corresponding straight line L2 and the two first functional relationship curves are as shown in fig. 5, and the prediction probabilities respectively corresponding to the two intersection points, that is, the threshold values respectively corresponding to the two phonemes under the reference angle value a2, are: p1 'and P2'. Assuming that the product of the two prediction probabilities is used as the fourth fraction, by comparing the third fraction corresponding to each voice sample in the second voice sample set with the fourth fraction, the correct wake-up number accumulated value and the false wake-up number accumulated value corresponding to all the voice samples at that time can be obtained, and based on these, the missed wake-up rate and the false wake-up rate of the reference angle value a2 can be obtained.

And so on until the reference angle has been updated to 90 degrees.

Through the multiple iterations, multiple groups of missed wakeup rates and false wakeup rates can be obtained, and then coordinate points corresponding to the groups of missed wakeup rates and the false wakeup rates can be located in another coordinate system, and curves corresponding to the coordinate points are drawn, namely the second function relation curves. It will thus be appreciated that a coordinate point on the second functional relationship curve corresponds in essence to a certain reference angle value.

After the second functional relation curve is obtained, a reference angle value corresponding to a second target coordinate on the second functional relation curve can be determined as a target angle threshold, wherein the second target coordinate enables the missed wake-up rate and the false wake-up rate to meet set conditions. Optionally, the setting condition may be, for example, selecting a coordinate with the smallest missed wake-up rate or the smallest false wake-up rate as the second target coordinate, or randomly selecting one of a plurality of coordinates satisfying that the missed wake-up rate is smaller than a certain set value and the false wake-up rate is smaller than the certain set value as the second target coordinate, and so on. The purpose of this set condition is to make a trade-off between the missed wake-up rate and the false wake-up rate.

The voice wake-up unit of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these voice wake-up units can be constructed using commercially available hardware components and configured by the steps taught in the present scheme.

Fig. 6 is a schematic structural diagram of a voice wake-up apparatus according to an embodiment of the present invention, as shown in fig. 6, the apparatus includes: the device comprises a receiving module 11, a determining module 12 and a control module 13.

And the receiving module 11 is used for receiving the user voice.

A determining module 12, configured to determine, by a wake-up system, a first score corresponding to the user speech on a first reference decoding path, where the first reference decoding path is established according to a first wake-up keyword that is customized for a target object by a user; and determining a second score for identifying whether the user voice is an awakening voice according to the threshold value corresponding to each phoneme on the first reference decoding path.

And the control module 13 is configured to wake up the target object if the first score is greater than or equal to the second score.

Optionally, the apparatus further comprises: and the path generation module is used for responding to the setting operation of the first awakening keyword, determining a phoneme sequence corresponding to the first awakening keyword according to the corresponding relation between the words and the phonemes described in the dictionary, and forming the first reference decoding path by the phoneme sequence.

Optionally, the determining module 12, in the process of determining the first score, may specifically be configured to: performing framing processing on the user voice to obtain a plurality of audio frames; sequentially inputting the audio frames into an acoustic model, and predicting phoneme probabilities corresponding to the audio frames by the acoustic model; determining, by a decoder, a first score corresponding to the user speech on the first reference decoding path according to a phoneme probability corresponding to each of the plurality of audio frames, wherein the wake-up system includes the acoustic model and the decoder.

Optionally, the determining module 12, in the process of determining the second score, may specifically be configured to: and determining the second score according to the threshold value corresponding to each phoneme on the first reference decoding path and the number of audio frames corresponding to each phoneme in a plurality of audio frames of the user voice.

Optionally, the apparatus further comprises: a phoneme threshold determining module, configured to label a correspondence between an audio frame and a phoneme of each speech sample in the first speech sample set; for any phoneme, generating a positive sample set and a negative sample set corresponding to the any phoneme, wherein the positive sample set comprises audio frames having the corresponding relation with the any phoneme, and the negative sample set comprises audio frames not having the corresponding relation with the any phoneme; traversing each sample in the positive sample set and the negative sample set, and outputting the prediction probability of the currently traversed sample under any phoneme through an acoustic model in the wake-up system; determining a first functional relation curve reflecting the missing recognition rate and the error recognition rate of any phoneme according to the prediction probability corresponding to each sample in the positive sample set and the negative sample set respectively; and determining that the prediction probability corresponding to a first target coordinate on the first functional relation curve is a threshold corresponding to any phoneme, wherein the first target coordinate enables the missing recognition rate and the false recognition rate to accord with set conditions.

Optionally, in the process of determining the first functional relationship curve, the phoneme threshold determining module may be specifically configured to: performing the following iterative process at least once until the reference probability value is updated to the upper value limit: updating the reference probability value; for any sample in the positive sample set and the negative sample set, if the prediction probability corresponding to the sample is greater than the current reference probability value and the sample belongs to the positive sample set, adding one to the number of correctly identified samples in the positive sample; if the prediction probability corresponding to any sample is greater than the current reference probability value and the any sample belongs to a negative sample set, adding one to the number of the incorrectly identified samples in the negative sample; determining the missing recognition rate and the error recognition rate of any phoneme under the current reference probability value according to the accumulated value of the number of correctly recognized samples in the positive samples and the accumulated value of the number of incorrectly recognized samples in the negative samples; and determining the first functional relation curve according to the missing recognition rate and the error recognition rate of any phoneme obtained under each reference probability value.

Optionally, in the process of determining the first target coordinate, the phoneme threshold determining module may be configured to: carrying out positive and negative sample marking on each voice sample in the second voice sample set, wherein the voice sample containing the second awakening keyword is marked as a positive sample, and the voice sample not containing the second awakening keyword is marked as a negative sample; determining, by the wake-up system, a third score corresponding to each voice sample in the second voice sample set on a second reference decoding path corresponding to the second wake-up keyword; determining an intersection point between a target straight line and a first function relation curve of each phoneme, wherein the target straight line is a straight line which has a target angle threshold value with a preset coordinate axis of a coordinate system where the first function relation curve is located and passes through a coordinate origin; determining a fourth score for identifying whether the voice sample is an awakening voice according to a threshold value which is respectively corresponding to each current phoneme contained in the second reference decoding path, wherein the threshold value which is respectively corresponding to each current phoneme is a prediction probability which is respectively corresponding to an intersection point on a corresponding first function relation curve; for any voice sample in the second voice sample set, determining a false awakening rate and a missing awakening rate of the target angle threshold according to an awakening voice recognition result obtained by comparing a third fraction and a fourth fraction of the voice sample and positive and negative sample mark information of the voice sample; and if the false awakening rate and the missed awakening rate of the target angle threshold value meet set conditions, determining the intersection point between the target straight line and the first functional relation curve of each phoneme as a first target coordinate on the corresponding first functional relation curve.

Optionally, in the determining the fourth score, the phoneme threshold determining module may be configured to: and for any voice sample in the second voice sample set, determining a fourth score for identifying whether the any voice sample is an awakening voice according to a threshold value respectively corresponding to each current phoneme included in the second reference decoding path and the number of audio frames respectively corresponding to each phoneme in a plurality of audio frames of the any voice sample.

Optionally, in the process of determining the false wake-up rate and the missing wake-up rate of the target angle threshold, the phoneme threshold determining module may be configured to: for any voice sample, if the third fraction of the voice sample is greater than the fourth fraction and the voice sample is marked as a positive sample, adding one to the correct wake-up time; if the third fraction of any voice sample is greater than the fourth fraction and any voice sample is marked as a negative sample, adding one to the number of false awakening times; and determining the missed awakening rate and the false awakening rate of the target angle threshold according to the correct awakening time accumulated value and the false awakening time accumulated value corresponding to all the voice samples in the second voice sample set.

Optionally, in the determining the target angle threshold, the phoneme threshold determining module may be configured to: and performing the following iterative process at least once until the reference angle value is updated to the upper value limit: updating the reference angle value; determining a missed awakening rate and a false awakening rate of the current reference angle value according to a correct awakening time accumulated value and a false awakening time accumulated value corresponding to all voice samples in the second voice sample set under the current reference angle value; determining a second function relation curve reflecting the missed awakening rate and the false awakening rate of each reference angle value according to the obtained missed awakening rate and the false awakening rate of each reference angle value; and determining a second target coordinate meeting set conditions on the second functional relation curve, wherein a reference angle value corresponding to the second target coordinate is used as the target angle threshold.

The apparatus shown in fig. 6 can perform the methods provided in the foregoing embodiments, and details of the portions of this embodiment that are not described in detail can refer to the related descriptions of the foregoing embodiments, which are not described herein again.

In one possible design, the structure of the voice wake-up apparatus shown in fig. 6 may be implemented as an electronic device, which may be a terminal device or a server, and as shown in fig. 7, the electronic device may include: a processor 21 and a memory 22. Wherein the memory 22 has stored thereon executable code, which when executed by the processor 21, makes the processor 21 capable of executing the voice wake-up method as provided in the foregoing embodiments.

In practice, the electronic device may also include a communication interface 23 for communicating with other devices.

In addition, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to perform a voice wake-up method as provided in the foregoing embodiments.

The above-described apparatus embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A voice wake-up method, comprising:

receiving user voice;

2. The method of claim 1, further comprising:

and responding to the setting operation of the first awakening keyword, determining a phoneme sequence corresponding to the first awakening keyword according to the corresponding relation between the words and the phonemes described in the dictionary, and forming the first reference decoding path by the phoneme sequence.

3. The method of claim 1, the determining of the first score step comprising:

performing framing processing on the user voice to obtain a plurality of audio frames;

sequentially inputting the audio frames into an acoustic model, and predicting phoneme probabilities corresponding to the audio frames by the acoustic model;

determining, by a decoder, a first score corresponding to the user speech on the first reference decoding path according to a phoneme probability corresponding to each of the plurality of audio frames, wherein the wake-up system includes the acoustic model and the decoder.

4. The method according to any one of claims 1 to 3, the determining of the second score step comprising:

and determining the second score according to the threshold value corresponding to each phoneme on the first reference decoding path and the number of audio frames corresponding to each phoneme in a plurality of audio frames of the user voice.

5. The method of claim 4, further comprising the step of determining a threshold value for the phoneme:

labeling the corresponding relation between the audio frames and the phonemes of the voice samples in the first voice sample set;

for any phoneme, generating a positive sample set and a negative sample set corresponding to the any phoneme, wherein the positive sample set comprises audio frames having the corresponding relation with the any phoneme, and the negative sample set comprises audio frames not having the corresponding relation with the any phoneme;

Traversing each sample in the positive sample set and the negative sample set, and outputting the prediction probability of the currently traversed sample under any phoneme through an acoustic model in the wake-up system;

determining a first functional relation curve reflecting the missing recognition rate and the error recognition rate of any phoneme according to the prediction probability corresponding to each sample in the positive sample set and the negative sample set respectively;

and determining that the prediction probability corresponding to a first target coordinate on the first functional relation curve is a threshold corresponding to any phoneme, wherein the first target coordinate enables the missing recognition rate and the false recognition rate to accord with set conditions.

6. The method of claim 5, the determining of the first functional relationship curve comprising: performing the following iterative process at least once until the reference probability value is updated to the upper value limit:

updating the reference probability value;

if the prediction probability corresponding to any sample is greater than the current reference probability value and the any sample belongs to a negative sample set, adding one to the number of the incorrectly identified samples in the negative sample;

Determining the missing recognition rate and the error recognition rate of any phoneme under the current reference probability value according to the accumulated value of the number of correctly recognized samples in the positive samples and the accumulated value of the number of incorrectly recognized samples in the negative samples;

and determining the first functional relation curve according to the missing recognition rate and the error recognition rate of any phoneme obtained under each reference probability value.

7. The method of claim 5, further comprising the step of determining the first target coordinates by:

carrying out positive and negative sample marking on each voice sample in the second voice sample set, wherein the voice sample containing the second awakening keyword is marked as a positive sample, and the voice sample not containing the second awakening keyword is marked as a negative sample;

determining, by the wake-up system, a third score corresponding to each voice sample in the second voice sample set on a second reference decoding path corresponding to the second wake-up keyword;

determining an intersection point between a target straight line and a first function relation curve of each phoneme, wherein the target straight line is a straight line which has a target angle threshold value with a preset coordinate axis of a coordinate system where the first function relation curve is located and passes through a coordinate origin;

Determining a fourth score for identifying whether the voice sample is an awakening voice according to a threshold value which is respectively corresponding to each current phoneme contained in the second reference decoding path, wherein the threshold value which is respectively corresponding to each current phoneme is a prediction probability which is respectively corresponding to an intersection point on a corresponding first function relation curve;

for any voice sample in the second voice sample set, determining a false awakening rate and a missing awakening rate of the target angle threshold according to an awakening voice recognition result obtained by comparing a third fraction and a fourth fraction of the voice sample and positive and negative sample mark information of the voice sample;

and if the false awakening rate and the missed awakening rate of the target angle threshold value meet set conditions, determining the intersection point between the target straight line and the first functional relation curve of each phoneme as a first target coordinate on the corresponding first functional relation curve.

8. The method of claim 7, the fourth fraction determining step comprising:

and for any voice sample in the second voice sample set, determining a fourth score for identifying whether the any voice sample is an awakening voice according to a threshold value respectively corresponding to each current phoneme included in the second reference decoding path and the number of audio frames respectively corresponding to each phoneme in a plurality of audio frames of the any voice sample.

9. The method of claim 7, the step of determining the false and missed wake-up rates for the target angle threshold comprising:

for any voice sample, if the third fraction of the voice sample is greater than the fourth fraction and the voice sample is marked as a positive sample, adding one to the correct wake-up time;

if the third fraction of any voice sample is greater than the fourth fraction and any voice sample is marked as a negative sample, adding one to the number of false awakening times;

and determining the missed awakening rate and the false awakening rate of the target angle threshold according to the correct awakening time accumulated value and the false awakening time accumulated value corresponding to all the voice samples in the second voice sample set.

10. The method of claim 9, further comprising the step of determining the target angle threshold:

and performing the following iterative process at least once until the reference angle value is updated to the upper value limit:

updating the reference angle value;

determining a missed awakening rate and a false awakening rate of the current reference angle value according to a correct awakening time accumulated value and a false awakening time accumulated value corresponding to all voice samples in the second voice sample set under the current reference angle value;

and determining a second target coordinate meeting set conditions on the second functional relation curve, wherein a reference angle value corresponding to the second target coordinate is used as the target angle threshold.

11. A voice wake-up apparatus comprising:

the receiving module is used for receiving user voice;

12. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the voice wake-up method of any of claims 1 to 10.