CN111862963B

CN111862963B - Voice wakeup method, device and equipment

Info

Publication number: CN111862963B
Application number: CN201910295356.5A
Authority: CN
Inventors: 陈梦喆; 雷鸣
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-04-12
Filing date: 2019-04-12
Publication date: 2024-05-10
Anticipated expiration: 2039-04-12
Also published as: CN111862963A

Abstract

The embodiment of the invention provides a voice awakening method, a device and equipment, wherein the method comprises the following steps: receiving voice output by a user; determining a first score corresponding to the voice on a first reference decoding path through a wake-up system, wherein the first reference decoding path is established according to a first wake-up keyword customized by a user for a target object; determining whether the voice is a second score of the wake-up voice or not according to the threshold value corresponding to each phoneme on the first reference decoding path; if the first score is greater than or equal to the second score, the target object is awakened. In the scheme, the wake-up system recognizes the wake-up keywords or the wake-up voices on a phoneme unit with finer dimension, namely, whether a sentence of voices are the wake-up voices or not is judged based on the threshold value of the phoneme level, so that the wake-up system has more universality and can accurately recognize various wake-up keywords customized by different users.

Description

Voice wakeup method, device and equipment

Technical Field

The present invention relates to the field of internet technologies, and in particular, to a method, an apparatus, and a device for waking up voice.

Background

With the continuous development of artificial intelligence, the man-machine interaction mode presents a multi-modal feature, wherein the voice interaction mode is especially supported by various products, and the products can be a certain device or a certain application program. Taking a certain device supporting a voice interaction manner as an example, at present, a common voice interaction manner is: when the user wants to use the device, he first needs to speak a wake-up voice to wake up the device, so that the device is switched from the sleep state to the working state, and then performs normal voice interaction with the device.

The wake-up voice is a voice containing at least a wake-up keyword. For example, assuming that the wake-up keyword is "panning, the user speaks" panning "or a voice such as" panning "can be regarded as the wake-up voice, and the corresponding device is switched from the sleep state to the operation state.

Currently, for a certain device, the wake-up key words are often preset by a developer. In order to be able to recognize whether the speech uttered by the user is wake-up speech that wakes up a device, a wake-up system (or wake-up model) may be trained by which to recognize the wake-up speech. Under the condition that the wake-up keyword is fixed, only a large number of voices containing the wake-up keyword are required to be collected to train the wake-up system, and the wake-up system obtained by training generally has good performance on the wake-up keyword, namely the wake-up system can accurately judge whether the input voices are wake-up voices containing the wake-up keyword or not, but if the wake-up keyword is changed, the performance of the wake-up system cannot be guaranteed, because a training sample set may not or slightly cover the new wake-up keyword. In practical applications, the user has a requirement of customizing the wake-up keywords, and similarly, if there are no or only a small number of corpus samples containing the user-customized wake-up keywords in the training sample set of the wake-up system, the performance of the wake-up system will be greatly reduced.

Disclosure of Invention

The embodiment of the invention provides a voice awakening method, device and equipment, which are used for detecting whether an audio stream contains user-defined awakening keyword voice or not in real time.

In a first aspect, an embodiment of the present invention provides a voice wake-up method, where the method includes:

Receiving user voice;

Determining a first score corresponding to the user voice on a first reference decoding path through a wake-up system, wherein the first reference decoding path is established according to a first wake-up keyword customized by a user for a target object;

determining a second score for identifying whether the user voice is wake-up voice according to the threshold value corresponding to each phoneme on the first reference decoding path;

and if the first score is greater than or equal to the second score, waking up the target object.

In a second aspect, an embodiment of the present invention provides a voice wake-up device, including:

the receiving module is used for receiving the voice of the user;

The determining module is used for determining a first score corresponding to the user voice on a first reference decoding path through the awakening system, wherein the first reference decoding path is established according to a first awakening keyword customized by a user for a target object; determining a second score for identifying whether the user voice is wake-up voice according to the threshold value corresponding to each phoneme on the first reference decoding path;

And the control module is used for waking up the target object if the first score is greater than or equal to the second score.

In a third aspect, an embodiment of the present invention provides an electronic device, including a processor and a memory, where the memory stores executable code, and when the executable code is executed by the processor, causes the processor to at least implement the voice wake-up method in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to at least implement the voice wake method of the first aspect.

In the embodiment of the invention, a user can customize a wake-up keyword for a target object (for example, the wake-up keyword can be a certain application program or a certain device), and based on the user-defined operation of the wake-up keyword by the user, the wake-up system can establish a reference decoding path corresponding to the wake-up keyword, wherein the reference decoding path is composed of a plurality of phonemes sequentially contained in the wake-up keyword. Based on this, in practical application, after a user speaks a sentence of speech, the wake-up system determines a first score corresponding to the speech on the reference decoding path, determines a second score for recognizing whether the speech is wake-up speech according to thresholds corresponding to phonemes on the reference decoding path, and further determines whether the speech is wake-up speech by comparing the first score and the second score. Specifically, if the first score is greater than or equal to the second score, determining that the voice is a wake-up voice, that is, determining that the voice includes a wake-up keyword voice, and waking up the target object. In the scheme, the wake-up system recognizes the wake-up keywords or the wake-up voices on a phoneme unit with finer dimension, namely, whether a sentence of voices are the wake-up voices or not is judged based on the threshold value of the phoneme level, so that the wake-up system has more universality and can accurately recognize various wake-up keywords customized by different users.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method of waking up speech according to an exemplary embodiment;

FIG. 2 is a flowchart of a method for determining a threshold for phoneme correspondence provided in an exemplary embodiment;

FIG. 3 is a schematic diagram of a phoneme threshold determination process provided by an example embodiment;

FIG. 4 is a flowchart of a method of determining first target coordinates provided by an exemplary embodiment;

FIG. 5 is a schematic diagram of a phoneme threshold determination process provided by an example embodiment;

FIG. 6 is a schematic diagram of a voice wake-up device according to an exemplary embodiment;

fig. 7 is a schematic structural diagram of an electronic device corresponding to the voice wake-up device provided in the embodiment shown in fig. 6.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well. Unless the context clearly indicates otherwise, "plurality" generally includes at least two.

The words "if", as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or system comprising such elements.

In addition, the sequence of steps in the method embodiments described below is only an example and is not strictly limited.

Fig. 1 is a flowchart of a voice wake-up method according to an exemplary embodiment, as shown in fig. 1, including the following steps:

101. user speech is received.

102. And determining a first score corresponding to the voice of the user on a first reference decoding path through the awakening system, wherein the first reference decoding path is established according to a first awakening keyword customized by the user for the target object.

103. And determining a second score for identifying whether the user voice is wake-up voice according to the threshold value corresponding to each phoneme on the first reference decoding path.

104. If the first score is greater than or equal to the second score, the target object is awakened.

The text-provided voice wake-up method can be executed by an electronic device, which can be a terminal device such as a PC, a notebook computer, or a server. The server may be a physical server comprising an independent host, or may be a virtual server carried by a host cluster, or may be a cloud server.

In the embodiment of the present invention, a user may customize a wake keyword for a target object (for example, may be a certain application program or a certain device), where a wake keyword customized by a certain user is referred to as a first wake keyword. In practical applications, the target object may provide a user interface, so that the user may perform the setting operation of the first wake-up keyword, for example, an input box is provided in a certain interface, so that the setting of the first wake-up keyword is performed through the input box. Therefore, after the user sets the first wake-up keyword, when the subsequent user wants to use the target object, the wake-up voice corresponding to the first wake-up keyword needs to be spoken first, so as to switch the target object from the dormant state to the working state.

Because the respective self-defined wake-up keywords of different users aiming at the same target object may be different, in order to support that different users can wake up the target object through the respective self-defined wake-up keywords, the embodiment of the invention provides a wake-up system (or may also be called as a wake-up model) capable of accurately identifying any wake-up keyword, namely provides a wake-up system with universality. In order to achieve this versatility, the wake-up system performs recognition of wake-up keywords, that is, wake-up voices, at the level of a phoneme unit of a finer dimension, so that recognition of the wake-up keywords can be achieved based on the phoneme unit even if the wake-up system has not learned a certain wake-up keyword at one time. The phoneme units may be single-phoneme units or multi-phoneme units such as triphone units.

Taking a user to customize a first wake-up keyword for a target object as an example, firstly, in response to a setting operation of the user on the first wake-up keyword, a wake-up system can determine a phoneme sequence corresponding to the first wake-up keyword according to a corresponding relation between words and phonemes described in a dictionary, and the phoneme sequence forms a first reference decoding path corresponding to the first wake-up keyword. When the first wake-up keyword is composed of a plurality of words, phonemes corresponding to the words are sequentially arranged to form a first reference decoding path. The first reference decoding path may be understood as that, when the user actually speaks the first wake-up keyword, each phoneme on the first reference decoding path should be theoretically decoded in sequence during the process of waking up the system to decode the speech.

After the first reference decoding path is set, the user can wake up the target object by the customized first wake-up keyword in the subsequent process of using the target object, and the following description is given to the process of actually using the target object by the user.

In practical application, after a user speaks a sentence of speech, first, the wake-up system needs to determine a first score corresponding to the speech on the reference decoding path. The internal components of the wake-up system may include an acoustic model and a decoder, when determining the first score, the received speech may be first subjected to frame-splitting processing to obtain a plurality of audio frames, where a duration of each audio frame ensures that each audio frame corresponds to only one phoneme, then the obtained plurality of audio frames may be sequentially input into the acoustic model, a phoneme probability corresponding to each of the plurality of audio frames is predicted by the acoustic model, and then the decoder determines, according to the phoneme probability corresponding to each of the plurality of audio frames, the first score corresponding to the speech on the first reference decoding path. The first score may be understood as a first wake-up keyword corresponding to the first reference decoding path, indicating how likely the content contained in the speech is.

In addition to the above determination of the first score, a second score for identifying whether the received speech is a wake-up speech (i.e., determining whether the speech is a speech corresponding to the first wake-up keyword or whether the content included in the speech corresponds to the first wake-up keyword) is determined according to the threshold value corresponding to each phoneme on the first reference decoding path.

It will be appreciated that identifying whether the received speech is a wake speech corresponding to the first wake keyword is actually a classification question, and that classification questions may generally be thresholded to achieve classification recognition, where the second score corresponds to a classification threshold for achieving classification recognition of whether the received speech is a wake speech corresponding to the first wake keyword. However, it should be emphasized that in this embodiment, the threshold unit corresponding to the second score is a phoneme, that is, the second score may be determined by the threshold corresponding to each phoneme passing through sequentially on the first reference decoding path.

That is, the second score may be regarded as a score that a speech is regarded as including a wake word needs to achieve, and the second score is determined by a threshold value corresponding to each phoneme on the first reference decoding path, and the threshold value corresponding to each phoneme may be understood as a score of what each phoneme that is expected to be a constituent part should achieve in order to achieve the second score.

In an alternative embodiment, the second score may be determined as: and respectively corresponding threshold sums of phonemes on the first reference decoding path. The second score is independent of the received speech, and is applicable to any received speech.

In another alternative embodiment, the second score may be determined according to a threshold value corresponding to each phoneme in the first reference decoding path and the number of audio frames corresponding to each phoneme in the plurality of audio frames of the received speech. Specifically, the threshold value corresponding to each phoneme is multiplied by the number of the audio frames corresponding to each phoneme, and the product corresponding to each phoneme is accumulated to obtain a second score. For example, assume that the first reference decoding path sequentially includes a phoneme a and a phoneme B, where the phoneme a corresponds to a threshold value of 0.6 and the phoneme B corresponds to a threshold value of 0.4. It is assumed that the received speech includes audio frame 1, audio frame 2, audio frame 3 and audio frame 4, and that audio frame 1, audio frame 2 and audio frame 3 each correspond to phoneme a, i.e. that these three audio frames each include only phoneme 1, and that audio frame 4 corresponds to phoneme B, at which time the second score=0.6x3+0.4x1=2.2. It follows that in this alternative embodiment, the determination of the threshold, i.e. the second score, for identifying whether the speech is a wake-up speech is determined by the individual phonemes that make up the first wake-up keyword and the length of time (characterized by the number of audio frames) that each phoneme takes during decoding of the speech, and the threshold determination is refined in such a way that the final recognition result is better.

The above-mentioned determination process of the threshold value corresponding to the phoneme will be described in the subsequent embodiments.

After the first score and the second score are obtained, it is determined whether the received voice is a wake-up voice by comparing the first score and the second score. Specifically, when the first score is greater than or equal to the second score, determining that the voice is wake-up voice, and then waking up the target object; otherwise, when the first score is smaller than the second score, the voice is determined not to be the awakening voice and is not responded.

In summary, the wake-up system recognizes wake-up keywords or wake-up voices on a phoneme unit with finer dimension, that is, whether a sentence of voices are wake-up voices is judged based on a phoneme level threshold, so that the wake-up system has more universality and can accurately recognize various wake-up keywords customized by different users.

The following describes a determination process of the threshold value corresponding to the phoneme.

Fig. 2 is a flowchart of a method for determining a threshold corresponding to a phoneme according to an exemplary embodiment, where, as shown in fig. 2, the method may include the following steps:

201. the correspondence between the audio frames and phonemes of each speech sample in the first speech sample set is noted.

For a target object, the first set of speech samples may be constructed by collecting a large number of speech output by a user during use of the target object.

For each voice sample in the first voice sample set, the text content corresponding to the voice sample can be manually marked in advance, and then the text content corresponding to the voice sample is converted into a phoneme sequence by combining with a dictionary which is generated in advance and describes the corresponding relation between words and phonemes.

In addition, the first speech sample set may be used to determine the threshold value corresponding to each phoneme, on the one hand, and to train the acoustic model contained in the wake-up system, on the other hand.

Regardless of whether the method is used for training the acoustic model or determining the threshold value corresponding to each phoneme, a certain preprocessing can be performed on each voice sample, and the preprocessing process at least comprises framing each voice sample, so that a one-to-one correspondence between each audio frame and each phoneme is obtained.

The process of training the acoustic model based on the first voice sample set is consistent with the training process of the existing acoustic model, and is not repeated. The following only describes how the determination of the threshold value for each phoneme is made based on the first set of speech samples and the already trained acoustic model.

202. And generating a positive sample set and a negative sample set corresponding to any phoneme for any phoneme, wherein the positive sample set comprises all audio frames with corresponding relations with any phoneme, and the negative sample set comprises all audio frames without corresponding relations with any high phoneme.

By labeling each voice sample in the first voice sample set, a plurality of phonemes contained in the first voice sample set can be obtained, and the corresponding threshold value determination processing is sequentially performed for each phoneme.

First, for any phoneme, it is assumed that it is denoted as a phoneme i, from a plurality of audio frames into which each speech sample in the first speech sample set is divided, an audio frame corresponding to the phoneme i is screened, and a positive sample set of the phoneme i is formed, and other audio frames not corresponding to the phoneme i form a negative sample set of the phoneme i.

203. And traversing each sample in the positive sample set and the negative sample set, and outputting the prediction probability of the currently traversed sample under any phoneme through an acoustic model in the wake-up system.

Assuming that the currently traversed sample is sample j, the sample j is input into an acoustic model, and the acoustic model predicts the probability that the sample j corresponds to each phoneme. Here, the prediction probability of the sample j under the phoneme i output by the acoustic model is extracted.

Thus, the prediction probability of each sample in the positive sample set or the negative sample set under the phoneme i can be obtained through the acoustic model. Further, the prediction probability under phoneme i can be marked for each sample in the positive and negative sample sets.

204. And determining a first functional relation curve reflecting the recognition omission factor and the misrecognition factor of any phoneme according to the prediction probabilities respectively corresponding to the samples in the positive sample set and the negative sample set.

205. And determining the prediction probability corresponding to the first target coordinates on the first functional relation curve as a threshold value corresponding to any phoneme, wherein the first target coordinates enable the recognition omission ratio and the false recognition ratio to meet the set conditions.

After obtaining the prediction probabilities of the samples in the positive sample set and the negative sample set under the phoneme i, two indexes, namely, the recognition omission ratio and the false recognition ratio, corresponding to the phoneme i can be calculated according to the prediction probabilities corresponding to the samples.

Here, the false call recognition is denoted as FA, assuming that the miss recognition rate is denoted as FR.

Fr=1-number of correctly identified samples in positive samples/total number of samples in positive sample set.

FA = number of erroneously identified samples in the negative samples/total number of samples in the negative sample set.

After the two indexes are calculated, a first functional relation curve reflecting the two indexes can be drawn in a set coordinate system, and FA can be taken as a horizontal axis and FR as a vertical axis, or vice versa.

It should be noted that, the miss recognition rate and the miss recognition rate corresponding to the calculated phoneme i may be understood as the miss recognition rate and the miss recognition rate corresponding to the phoneme i under different prediction probabilities, where the different prediction probabilities are the prediction probabilities corresponding to the positive sample set and the negative sample set output by the acoustic model under the phoneme i.

One implementation of determining the first functional relationship is given below:

The following iterative process is performed at least once until the reference probability value has been updated to the upper value limit:

Updating the reference probability value;

For any sample in the positive sample set and the negative sample set, if the prediction probability corresponding to the any sample is larger than the current reference probability value and the any sample belongs to the positive sample set, adding one to the number of correctly identified samples in the positive sample;

if the prediction probability corresponding to any sample is greater than the current reference probability value and the any sample belongs to the negative sample set, adding one to the number of samples which are incorrectly identified in the negative sample;

Determining the missing recognition rate and the false recognition rate of the phoneme i under the current reference probability value according to the accumulated value of the number of correctly recognized samples in the positive samples and the accumulated value of the number of incorrectly recognized samples in the negative samples;

and determining a first functional relation curve according to the missing recognition rate and the false recognition rate of the phoneme i obtained under each reference probability value.

The reference probability value can be understood as a parameter variable, and the value coverage prediction probability value range is 0-1.

In an alternative embodiment, the prediction probabilities corresponding to the samples in the positive sample set and the negative sample set may be sorted in order from small to large or from large to small, so that the reference probability values may be updated by sequentially traversing each prediction probability, where the reference probability values correspond to the prediction probabilities that will be sequentially updated into the prediction probability sequence.

In another alternative embodiment, the reference probability value may be updated sequentially from an initial value, for example, 0, in a step size of 0.1, according to a set step size, then the reference probability value is updated to 0.1 for the second time, to 0.2 for the third time, and so on until the reference probability value is updated to 1.

The iterative process described above is illustrated by way of example below. Assuming that the reference probability value is a in the initial case, the update step size is 0.1. For any sample in the positive sample set and the negative sample set, it is assumed that sample j, if the prediction probability corresponding to sample j is greater than a and sample j belongs to the positive sample set, the number of correctly identified samples in the positive sample is increased by one, which indicates that sample j originally belongs to the positive sample and is actually identified as the positive sample. However, if the prediction probability corresponding to the sample j is greater than a and the sample j belongs to the negative sample set, the number of erroneously identified samples in the negative samples is increased by one, which indicates that the sample j originally belongs to the negative sample but is erroneously identified as a positive sample. After the above-described determination is made for all the samples in the positive sample set and the negative sample set, it is possible to obtain an integrated value of the number of correctly recognized samples in the positive sample (assumed to be C1) and an integrated value of the number of erroneously recognized samples in the negative sample (assumed to be C2). Then, in the case of taking a reference probability value, a set of FA and FR values can be calculated: fr=1-C1/Z1, fa=c2/Z2, where Z1 and Z2 represent the total number of samples in the positive sample set and the total number of samples in the negative sample set, respectively.

And then updating the reference probability value to be a+0.1, and repeating the judging process to obtain a group of corresponding FA and FR values under a+0.1. And iterating in this way, guiding the updating of the reference probability value to be the upper limit value, and supposing b, at the moment, obtaining a group of corresponding FA and FR values under b.

After obtaining each group of FA and FR values corresponding to each reference probability value from a to b, coordinate points corresponding to the groups of FA and FR values can be positioned in a coordinate system, and then curves corresponding to the coordinate points are drawn, namely the first functional relation curves. Thus, it can be appreciated that one coordinate point on the first functional relationship curve actually corresponds to a certain reference probability value, i.e. a certain prediction probability.

After the first functional relation curve is obtained, the prediction probability corresponding to the first target coordinate on the first functional relation curve can be determined to be the threshold value corresponding to the phoneme i, wherein the first target coordinate enables the recognition omission factor and the false recognition factor to meet the set condition.

Alternatively, the setting condition may be, for example, selecting the coordinate with the smallest FA or the smallest FR as the first target coordinate, or randomly selecting one of a plurality of coordinates satisfying that FA is smaller than a certain set value and FR is smaller than a certain set value as the first target coordinate, or the like. The purpose of this setting condition is to compromise between the miss recognition rate and the false recognition rate.

In the above embodiment, the determining process of the corresponding threshold value is described by taking any phoneme i as an example, and the above process is performed on other phonemes as well, and the first function relation curve corresponding to each phoneme may be drawn in the coordinate system, so that the threshold value corresponding to each phoneme may be obtained.

As shown in fig. 3, fig. 3 illustrates a first functional relation curve corresponding to each of two phonemes, i and j, and illustrates first target coordinates corresponding to each of the phonemes, which are respectively: (FA 1, FR 1), (FA 2, FR 2). The prediction probabilities corresponding to these two coordinates are assumed to be: p1, P2, the thresholds for phonemes i, j are thus P1, P2, respectively. In addition, assuming that the first reference decoding path in the foregoing includes a phoneme i and a phoneme j in sequence, in an alternative embodiment, the second fraction in the foregoing is P1 x P2.

For the first functional relation corresponding to any phoneme i, several ways of determining the first target coordinate on the first functional relation are described in the foregoing embodiments, and another way of determining the first target coordinate is described below.

FIG. 4 is a flowchart of a method for determining first target coordinates according to an exemplary embodiment, as shown in FIG. 4, the method may include the steps of:

401. and marking positive and negative samples of each voice sample in the second voice sample set, wherein the voice samples containing the second wake-up keywords are marked as positive samples, and the voice samples not containing the second wake-up keywords are marked as negative samples.

402. And determining respective third scores of the voice samples in the second voice sample set on the second reference decoding paths corresponding to the second awakening keywords through the awakening system.

403. And determining an intersection point between a target straight line and the first function relation curve of each phoneme, wherein the target straight line is a straight line which is a target angle threshold value with a preset coordinate axis of a coordinate system where the first function relation curve is located and passes through a coordinate origin.

404. And determining whether the voice sample is a fourth score of wake-up voice or not according to the current corresponding threshold value of each phoneme contained in the second reference decoding path, wherein the current corresponding threshold value of each phoneme is the prediction probability corresponding to the intersection point on the corresponding first functional relation curve.

405. And for any voice sample in the second voice sample set, determining the false wake-up rate and the false wake-up rate of the target angle threshold according to the wake-up voice recognition result obtained by comparing the third score with the fourth score of the any voice sample and the positive and negative sample mark information of the any voice sample.

406. If the false wake-up rate and the false wake-up rate of the target angle threshold meet the set conditions, determining the intersection point between the target straight line and the first function relation curve of each phoneme as a first target coordinate on the corresponding first function relation curve.

In this embodiment, the second speech sample set is used to assist in determining the threshold value corresponding to each phoneme (i.e. determining the first target coordinates on the first functional relationship curve corresponding to each phoneme), and may also detect the performance of the wake-up system. Because the training of the first speech sample set in the embodiment shown in fig. 2 is performed, the acoustic model in the wake-up system is already trained, and at this time, it can be detected whether the training result is reliable.

For the target object, the tester can customize the wake-up keyword, called a second wake-up keyword, and collect the voice containing the second wake-up keyword as a positive voice sample and the other voices not containing the second wake-up keyword as negative voice samples, so as to form the second voice sample set.

For example, assume that the second wake key is: genius baby. Then voices containing four consecutive words of the eidolon baby, such as "eidolon baby", "turn on eidolon baby", "eidolon baby hello", etc., are all considered positive voice samples, the other voice samples being negative voice samples.

As described in the foregoing embodiment, in response to the setting operation of the second wake-up keyword by the tester, the wake-up system may generate a second reference decoding path corresponding to the second wake-up keyword, where the second reference decoding path is formed by a phoneme sequence included in the second wake-up keyword.

After the second voice sample set is obtained, each voice sample in the second voice sample set is sequentially input into a wake-up system, and the wake-up system determines the third score corresponding to each voice sample in the second voice sample set on a second reference decoding path corresponding to a second wake-up keyword. The determination of the third score is the same as the determination of the first score in the previous embodiment, and is not described here.

In this embodiment, the core idea of searching the first target coordinates corresponding to the corresponding phonemes on the first functional relation curve corresponding to each phoneme is as follows: if a target angle threshold can be determined in the coordinate system in which the first functional relationship curves are located, a target line which is the target angle threshold with respect to a preset coordinate axis of the coordinate system, such as a transverse axis, and passes through the origin of coordinates will have an intersection with each of the first functional relationship curves. If the intersection point on each first functional relation curve is used as the first target coordinate corresponding to the corresponding phoneme, the prediction probability corresponding to the intersection points is the threshold corresponding to the corresponding phoneme, a desired score (namely, the fourth score below) is determined by the threshold corresponding to each phoneme at the moment, the desired score is used as a basis for identifying whether each voice sample in the second voice sample set corresponds to wake-up voice or contains the second wake-up keyword, and if the identification result finally obtained based on the desired score is good in performance, each intersection point can be considered to be the first target coordinate on the corresponding first functional relation curve. The quality of the performance of the identification result can be judged by the following two indexes: a missed wake-up rate and a false wake-up rate. Assuming that the missed wake-up rate is denoted as FR 'and the false wake-up rate is denoted as FA', there are:

FR' =1-number of correctly awakened samples in positive speech samples/total number of positive speech samples.

FA' =number of erroneously awakened samples in negative speech samples/total number of negative speech samples.

Assuming that the second speech sample set includes N1 positive speech samples and N2 negative speech samples, the total number of positive speech samples is N1 and the total number of negative speech samples is N2.

Therefore, if the first target coordinates on each first functional relation curve are to be obtained, the corresponding target angle threshold value, in which the missed wake-up rate and the false wake-up rate meet the set requirements, needs to be found first.

Assuming that the target angle threshold has been found, it will be appreciated that where the target angle threshold has been found, the threshold corresponding to each phoneme has been determined, and then from the threshold currently corresponding to each phoneme, it can be determined whether the speech sample is the fourth score of wake-up speech. The current corresponding threshold value of each phoneme is the prediction probability corresponding to the intersection point of the target straight line and the corresponding first function relation curve.

In an alternative embodiment, the fourth score may be determined as: and the product of the current corresponding thresholds of each phoneme on the second reference decoding path. It is understood that the fourth score is independent of the voice samples, and is applicable to any voice sample in the second voice sample set.

In another alternative embodiment, for any voice sample in the second voice sample set, the fourth score for identifying whether the any voice sample is a wake-up voice may be determined according to the threshold value respectively corresponding to each phoneme currently included in the second reference decoding path and the number of audio frames respectively corresponding to each phoneme in a plurality of audio frames of the any voice sample. At this time, the fourth score corresponding to each voice sample may be different.

In the following, it is first assumed that the target angle threshold has been found, and how to determine the missed wake-up rate and the false wake-up rate of the target angle threshold is described.

For any voice sample in the second voice sample set, assuming a voice sample k, if the third score of the voice sample k is greater than the fourth score and the voice sample k is marked as a positive sample, adding one to the correct wake-up times; if the third score of voice sample k is greater than the fourth score and voice sample k is marked as a negative sample, the number of false wakeups is increased by one. Therefore, according to the correct wake-up times accumulated value and the error wake-up times accumulated value corresponding to all the voice samples in the second voice sample set, the wake-up missing rate and the wake-up error rate corresponding to the target angle threshold can be determined.

Next, how to find the target angle threshold will be described.

In determining the target angle threshold, the following iterative process needs to be performed at least once until the reference angle value has been updated to the upper value limit:

Updating the reference angle value;

Determining the missed wake-up rate and the false wake-up rate of the current reference angle value according to the correct wake-up times accumulated value and the false wake-up times accumulated value corresponding to all the voice samples in the second voice sample set under the current reference angle value;

determining a second function relation curve reflecting the missed wake-up rate and the false wake-up rate of each reference angle value according to the obtained missed wake-up rate and the false wake-up rate of each reference angle value;

and determining a second target coordinate which meets the set condition on the second functional relation curve, wherein a reference angle value corresponding to the second target coordinate is used as a target angle threshold value.

Therefore, in the process of determining the target angle threshold, a plurality of iterative processes are needed to obtain the missed wake-up rate and the false wake-up rate under different reference angle values, so that a second functional relation curve is drawn based on the obtained pairs of missed wake-up rates and false wake-up rates. It can be seen that the determination of the second functional relationship is similar to the determination of the first functional relationship.

Specifically, taking the reference angle value as an example, the included angle between the straight line passing through the origin of coordinates and the transverse axis is taken as the reference angle value, and the range of the reference angle value is 0-90 degrees. The iterative process described above may be: starting at 0 degrees, the reference angle values are successively updated in steps such as 5 degrees, 10 degrees, etc., until the reference angle values are updated to 90 degrees, and the target angle threshold value is found from the traversed reference angle values.

A specific implementation of the above iterative process is schematically described below with reference to fig. 5.

Fig. 5 illustrates a first functional relationship curve corresponding to each of the phonemes i and j. Assuming that the reference angle value updated currently is a1, so that the intersection point of the corresponding straight line L1 and the two first functional relationship curves is shown in fig. 5, it is assumed that the prediction probabilities respectively corresponding to the two intersection points, that is, the thresholds respectively corresponding to the two phonemes under the reference angle value a1 are: p1, P2. Assuming that the product of the two prediction probabilities is taken as a fourth score, by comparing the third score and the fourth score corresponding to each voice sample in the second voice sample set, a correct wake-up number accumulated value and a wrong wake-up number accumulated value corresponding to all voice samples at the moment can be obtained, and based on the correct wake-up number accumulated value and the wrong wake-up number accumulated value, the wake-up missing rate and the wrong wake-up rate of the reference angle value a1 can be obtained.

Then, it is assumed that the updated reference angle value has been updated to a2, so that the intersection of the corresponding straight line L2 and the two first functional relationship curves is as shown in fig. 5, and it is assumed that the prediction probabilities respectively corresponding to the two intersections, that is, the thresholds respectively corresponding to the two phonemes under the reference angle value a2 are: p1', P2'. Assuming that the product of the two prediction probabilities is taken as a fourth score, by comparing the third score and the fourth score corresponding to each voice sample in the second voice sample set, a correct wake-up number accumulated value and a wrong wake-up number accumulated value corresponding to all voice samples at the moment can be obtained, and based on the correct wake-up number accumulated value and the wrong wake-up number accumulated value, the wake-up missing rate and the wrong wake-up rate of the reference angle value a2 can be obtained.

And so on until the reference angle has been updated to 90 degrees.

Through the multiple iterations, multiple groups of missed wake-up rates and false wake-up rates can be obtained, coordinate points corresponding to the groups of missed wake-up rates and false wake-up rates can be located in another coordinate system, and curves corresponding to the coordinate points are drawn, namely second functional relation curves. Thus, it can be appreciated that one coordinate point on the second functional relationship curve actually corresponds to a certain reference angle value.

After the second functional relation curve is obtained, a reference angle value corresponding to a second target coordinate on the second functional relation curve can be determined to be a target angle threshold, wherein the second target coordinate enables the missed wake-up rate and the false wake-up rate to meet the set conditions. Alternatively, the setting condition may be, for example, selecting, as the second target coordinate, a coordinate with the minimum missed wake-up rate or the minimum false wake-up rate, or randomly selecting, as the second target coordinate, one of a plurality of coordinates satisfying that the missed wake-up rate is smaller than a certain set value and the false wake-up rate is smaller than a certain set value, or the like. The purpose of this set condition is to trade off between the missed wake-up rate and the false wake-up rate.

The voice wakeup apparatus of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these voice wake-up devices can be configured using commercially available hardware components through the steps taught by the present solution.

Fig. 6 is a schematic structural diagram of a voice wake-up device according to an embodiment of the present invention, as shown in fig. 6, where the device includes: a receiving module 11, a determining module 12 and a control module 13.

The receiving module 11 is configured to receive a user voice.

A determining module 12, configured to determine, by using a wake-up system, a first score corresponding to the user speech on a first reference decoding path, where the first reference decoding path is established according to a first wake-up keyword customized by a user for a target object; and determining a second score for identifying whether the user voice is wake-up voice according to the threshold value corresponding to each phoneme on the first reference decoding path.

A control module 13, configured to wake up the target object if the first score is greater than or equal to the second score.

Optionally, the apparatus further comprises: and the path generation module is used for responding to the setting operation of the first awakening key words, determining a phoneme sequence corresponding to the first awakening key words according to the corresponding relation between words and phonemes described in the dictionary, and forming the first reference decoding path by the phoneme sequence.

Optionally, the determining module 12 may be specifically configured to, in determining the first score: framing the user voice to obtain a plurality of audio frames; sequentially inputting the plurality of audio frames into an acoustic model, and predicting phoneme probabilities corresponding to the plurality of audio frames by the acoustic model; determining, by a decoder, a first score corresponding to the user speech on the first reference decoding path according to phoneme probabilities corresponding to each of the plurality of audio frames, wherein the wake-up system includes the acoustic model and the decoder.

Optionally, the determining module 12 may be specifically configured to, in determining the second score: and determining the second score according to the threshold value corresponding to each phoneme on the first reference decoding path and the number of audio frames corresponding to each phoneme in a plurality of audio frames of the user voice.

Optionally, the apparatus further comprises: the phoneme threshold determining module is used for marking the corresponding relation between the audio frames of each voice sample in the first voice sample set and the phonemes; for any phoneme, generating a positive sample set and a negative sample set corresponding to the any phoneme, wherein the positive sample set comprises audio frames with the corresponding relation with the any phoneme, and the negative sample set comprises audio frames without the corresponding relation with the any phoneme; traversing each sample in the positive sample set and the negative sample set, and outputting the prediction probability of the currently traversed sample under any phoneme through an acoustic model in the wake-up system; determining a first function relation curve reflecting the recognition omission factor and the false recognition factor of any phoneme according to the prediction probabilities respectively corresponding to the positive sample set and the negative sample set; and determining that the prediction probability corresponding to a first target coordinate on the first functional relation curve is a threshold value corresponding to any phoneme, wherein the first target coordinate enables the recognition omission factor and the false recognition factor to meet a set condition.

Wherein, optionally, in determining the first functional relation, the phoneme threshold determining module may specifically be configured to: the following iterative process is performed at least once until the reference probability value has been updated to the upper value limit: updating the reference probability value; for any sample in the positive sample set and the negative sample set, if the prediction probability corresponding to the any sample is greater than the current reference probability value and the any sample belongs to the positive sample set, adding one to the number of correctly identified samples in the positive sample; if the prediction probability corresponding to any sample is greater than the current reference probability value and the any sample belongs to a negative sample set, adding one to the number of samples which are erroneously identified in the negative sample; determining the missing recognition rate and the false recognition rate of any phoneme under the current reference probability value according to the accumulated value of the number of correctly recognized samples in the positive samples and the accumulated value of the number of incorrectly recognized samples in the negative samples; and determining the first functional relation curve according to the miss recognition rate and the false recognition rate of any phoneme obtained under each reference probability value.

Optionally, in determining the first target coordinates, the phoneme threshold determining module may be configured to: carrying out positive and negative sample marking on each voice sample in the second voice sample set, wherein the voice sample containing the second awakening keywords is marked as a positive sample, and the voice sample not containing the second awakening keywords is marked as a negative sample; determining, by the wake-up system, respective third scores of the voice samples in the second voice sample set on second reference decoding paths corresponding to the second wake-up keywords; determining an intersection point between a target straight line and a first function relation curve of each phoneme, wherein the target straight line is a straight line which is a target angle threshold value with a preset coordinate axis of a coordinate system where the first function relation curve is located and passes through a coordinate origin; determining whether the voice sample is a fourth score of wake-up voice or not according to the current corresponding threshold value of each phoneme contained in the second reference decoding path, wherein the current corresponding threshold value of each phoneme is a prediction probability corresponding to an intersection point on a corresponding first functional relation curve; for any voice sample in the second voice sample set, determining the false wake-up rate and the false wake-up rate of the target angle threshold according to a wake-up voice recognition result obtained by comparing the third score with the fourth score of the any voice sample and positive and negative sample mark information of the any voice sample; if the false wake-up rate and the false wake-up rate of the target angle threshold meet the set conditions, determining the intersection point between the target straight line and the first function relation curve of each phoneme as a first target coordinate on the corresponding first function relation curve.

Optionally, in determining the fourth score, the phoneme threshold determining module may be configured to: and for any voice sample in the second voice sample set, determining whether the any voice sample is a fourth score of wake-up voice according to a threshold value respectively corresponding to each phoneme contained on the second reference decoding path at present and the number of audio frames respectively corresponding to each phoneme in a plurality of audio frames of the any voice sample.

Optionally, in determining the false wake up rate and the missed wake up rate of the target angle threshold, the phoneme threshold determining module may be configured to: for any voice sample, if the third score of any voice sample is greater than the fourth score and any voice sample is marked as a positive sample, adding one to the correct wake-up times; if the third score of any voice sample is greater than the fourth score and any voice sample is marked as a negative sample, adding one to the number of false awakenings; and determining the wake-up missing rate and the wake-up error rate of the target angle threshold according to the correct wake-up times accumulated value and the error wake-up times accumulated value corresponding to all the voice samples in the second voice sample set.

Optionally, in determining the target angle threshold, the phoneme threshold determining module may be configured to: the following iterative process is performed at least once until the reference angle value has been updated to the upper value limit: updating the reference angle value; determining the missed wake-up rate and the false wake-up rate of the current reference angle value according to the correct wake-up times accumulated value and the false wake-up times accumulated value corresponding to all the voice samples in the second voice sample set under the current reference angle value; determining a second function relation curve reflecting the missed wake-up rate and the false wake-up rate of each reference angle value according to the obtained missed wake-up rate and the false wake-up rate of each reference angle value; and determining a second target coordinate meeting a set condition on the second functional relation curve, wherein a reference angle value corresponding to the second target coordinate is used as the target angle threshold value.

The apparatus shown in fig. 6 may perform the method provided in the foregoing embodiments, and for those portions of this embodiment that are not described in detail, reference may be made to the description related to the foregoing embodiments, which are not repeated here.

In one possible design, the structure of the voice wake apparatus shown in fig. 6 may be implemented as an electronic device, where the electronic device may be a terminal device or a server, as shown in fig. 7, and the electronic device may include: a processor 21, and a memory 22. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, causes the processor 21 to perform the voice wake-up method as provided in the previous embodiments.

In practice, the electronic device may also include a communication interface 23 for communicating with other devices.

Additionally, embodiments of the present invention provide a non-transitory machine-readable storage medium having stored thereon executable code that, when executed by a processor of an electronic device, causes the processor to perform a voice wake-up method as provided in the previous embodiments.

The apparatus embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by adding necessary general purpose hardware platforms, or may be implemented by a combination of hardware and software. Based on such understanding, the foregoing aspects, in essence and portions contributing to the art, may be embodied in the form of a computer program product, which may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A voice wakeup method, comprising:

Receiving user voice;

if the first score is greater than or equal to the second score, waking up the target object;

The step of determining the threshold value corresponding to the phonemes comprises the following steps:

Labeling the corresponding relation between the audio frame and the phonemes of each voice sample in the first voice sample set;

For any phoneme, generating a positive sample set and a negative sample set corresponding to the any phoneme, wherein the positive sample set comprises audio frames with the corresponding relation with the any phoneme, and the negative sample set comprises audio frames without the corresponding relation with the any phoneme;

Traversing each sample in the positive sample set and the negative sample set, and outputting the prediction probability of the currently traversed sample under any phoneme through an acoustic model in the wake-up system;

determining a first function relation curve reflecting the recognition omission factor and the false recognition factor of any phoneme according to the prediction probabilities respectively corresponding to the positive sample set and the negative sample set;

And determining that the prediction probability corresponding to a first target coordinate on the first functional relation curve is a threshold value corresponding to any phoneme, wherein the first target coordinate enables the recognition omission factor and the false recognition factor to meet a set condition.

2. The method of claim 1, further comprising:

and responding to the setting operation of the first awakening keywords, determining a phoneme sequence corresponding to the first awakening keywords according to the corresponding relation between words and phonemes described in a dictionary, and forming the first reference decoding path by the phoneme sequence.

3. The method of claim 1, the determining of the first score comprising:

Framing the user voice to obtain a plurality of audio frames;

sequentially inputting the plurality of audio frames into an acoustic model, and predicting phoneme probabilities corresponding to the plurality of audio frames by the acoustic model;

Determining, by a decoder, a first score corresponding to the user speech on the first reference decoding path according to phoneme probabilities corresponding to each of the plurality of audio frames, wherein the wake-up system includes the acoustic model and the decoder.

4. A method according to any one of claims 1 to 3, the step of determining the second score comprising:

And determining the second score according to the threshold value corresponding to each phoneme on the first reference decoding path and the number of audio frames corresponding to each phoneme in a plurality of audio frames of the user voice.

5. The method of claim 1, the determining of the first functional relationship comprising: the following iterative process is performed at least once until the reference probability value has been updated to the upper value limit:

Updating the reference probability value;

For any sample in the positive sample set and the negative sample set, if the prediction probability corresponding to the any sample is greater than the current reference probability value and the any sample belongs to the positive sample set, adding one to the number of correctly identified samples in the positive sample;

if the prediction probability corresponding to any sample is greater than the current reference probability value and the any sample belongs to a negative sample set, adding one to the number of samples which are erroneously identified in the negative sample;

determining the missing recognition rate and the false recognition rate of any phoneme under the current reference probability value according to the accumulated value of the number of correctly recognized samples in the positive samples and the accumulated value of the number of incorrectly recognized samples in the negative samples;

and determining the first functional relation curve according to the miss recognition rate and the false recognition rate of any phoneme obtained under each reference probability value.

6. The method of claim 1, further comprising the step of determining the first target coordinates as follows:

carrying out positive and negative sample marking on each voice sample in the second voice sample set, wherein the voice sample containing the second awakening keywords is marked as a positive sample, and the voice sample not containing the second awakening keywords is marked as a negative sample;

Determining, by the wake-up system, respective third scores of the voice samples in the second voice sample set on second reference decoding paths corresponding to the second wake-up keywords;

determining an intersection point between a target straight line and a first function relation curve of each phoneme, wherein the target straight line is a straight line which is a target angle threshold value with a preset coordinate axis of a coordinate system where the first function relation curve is located and passes through a coordinate origin;

determining whether the voice sample is a fourth score of wake-up voice or not according to the current corresponding threshold value of each phoneme contained in the second reference decoding path, wherein the current corresponding threshold value of each phoneme is a prediction probability corresponding to an intersection point on a corresponding first functional relation curve;

for any voice sample in the second voice sample set, determining the false wake-up rate and the false wake-up rate of the target angle threshold according to a wake-up voice recognition result obtained by comparing the third score with the fourth score of the any voice sample and positive and negative sample mark information of the any voice sample;

if the false wake-up rate and the false wake-up rate of the target angle threshold meet the set conditions, determining the intersection point between the target straight line and the first function relation curve of each phoneme as a first target coordinate on the corresponding first function relation curve.

7. The method of claim 6, the fourth score determining step comprising:

And for any voice sample in the second voice sample set, determining whether the any voice sample is a fourth score of wake-up voice according to a threshold value respectively corresponding to each phoneme contained on the second reference decoding path at present and the number of audio frames respectively corresponding to each phoneme in a plurality of audio frames of the any voice sample.

8. The method of claim 6, the determining of the false wake up rate and the missed wake up rate for the target angle threshold comprising:

for any voice sample, if the third score of any voice sample is greater than the fourth score and any voice sample is marked as a positive sample, adding one to the correct wake-up times;

if the third score of any voice sample is greater than the fourth score and any voice sample is marked as a negative sample, adding one to the number of false awakenings;

and determining the wake-up missing rate and the wake-up error rate of the target angle threshold according to the correct wake-up times accumulated value and the error wake-up times accumulated value corresponding to all the voice samples in the second voice sample set.

9. The method of claim 8, further comprising the step of determining the target angle threshold:

The following iterative process is performed at least once until the reference angle value has been updated to the upper value limit:

Updating the reference angle value;

And determining a second target coordinate meeting a set condition on the second functional relation curve, wherein a reference angle value corresponding to the second target coordinate is used as the target angle threshold value.

10. A voice wakeup apparatus comprising:

the receiving module is used for receiving the voice of the user;

The control module is used for waking up the target object if the first score is greater than or equal to the second score;

The phoneme threshold determining module is used for marking the corresponding relation between the audio frames of each voice sample in the first voice sample set and the phonemes; for any phoneme, generating a positive sample set and a negative sample set corresponding to the any phoneme, wherein the positive sample set comprises audio frames with the corresponding relation with the any phoneme, and the negative sample set comprises audio frames without the corresponding relation with the any phoneme; traversing each sample in the positive sample set and the negative sample set, and outputting the prediction probability of the currently traversed sample under any phoneme through an acoustic model in the wake-up system; determining a first function relation curve reflecting the recognition omission factor and the false recognition factor of any phoneme according to the prediction probabilities respectively corresponding to the positive sample set and the negative sample set; and determining that the prediction probability corresponding to a first target coordinate on the first functional relation curve is a threshold value corresponding to any phoneme, wherein the first target coordinate enables the recognition omission factor and the false recognition factor to meet a set condition.

11. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the voice wake method of any of claims 1 to 9.