CN116705015A

CN116705015A - Equipment wake-up method, device and computer readable storage medium

Info

Publication number: CN116705015A
Application number: CN202210174193.7A
Authority: CN
Inventors: 赵惟肖; 史润宇
Original assignee: Beijing Xiaomi Mobile Software Co Ltd
Current assignee: Beijing Xiaomi Mobile Software Co Ltd
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2023-09-05

Abstract

The present disclosure relates to a device wake-up method, apparatus and computer readable storage medium, and relates to the field of device control, the method comprising: obtaining a plurality of phoneme frames according to received voice information, wherein the duration of each phoneme frame is a preset duration, inputting the plurality of phoneme frames into a preset acoustic model to obtain a first phoneme sequence, wherein the first phoneme sequence comprises a plurality of phonemes corresponding to the plurality of phoneme frames and judgment probability of each phoneme, and then determining a wake-up result of the voice information according to the first phoneme sequence and judgment probability of the plurality of phonemes in the first phoneme sequence under the condition that invalid phonemes and/or missing phonemes in the first phoneme sequence are determined to meet preset conditions. The method can realize the recognition of the phonemes in the voice information without setting a complex decoder, and can realize the simplification of the phoneme-level voice awakening technology.

Description

Equipment wake-up method, device and computer readable storage medium

Technical Field

The present disclosure relates to the field of device control, and in particular, to a device wake-up method, apparatus, and computer readable storage medium.

Background

Along with the development of intelligent voice technology, more and more various intelligent voice devices appear, voice wake-up is an entry for interaction between the intelligent voice devices and users, and means that the intelligent voice devices (such as mobile phones, home appliances, vehicle-mounted systems and the like) detect specific voice instructions in a dormant or screen locking state, so that the devices in the dormant state enter a waiting instruction state, and a first step of voice interaction is started.

The voice wake-up technology can be divided into keyword recognition and voiceprint recognition according to different recognition objects: keyword recognition refers to the recognition of specific voice instructions, namely wake-up words, in a continuous voice stream. Voiceprint recognition, also known as speaker recognition, is a technique that extracts the voice characteristics of a speaker and verifies the identity of the speaker. In the keyword recognition technology of voice wakeup, the keyword recognition technology can be further subdivided into: end-to-end voice wakeup, word level voice wakeup, phoneme level voice wakeup.

In the related art, for the phoneme level voice wake-up technology, usually, after determining the phonemes corresponding to each frame of voice, a decoder is further needed to be connected to perform further determination, and the decoder is complex to implement and has a certain requirement on hardware.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides a device wake-up method, apparatus, and computer-readable storage medium.

According to a first aspect of an embodiment of the present disclosure, there is provided a device wake-up method, including:

acquiring a plurality of phoneme frames according to the received voice information, wherein the duration of each phoneme frame is a preset duration;

inputting the plurality of phoneme frames into a preset acoustic model to obtain a first phoneme sequence, wherein the first phoneme sequence comprises a plurality of phonemes corresponding to the plurality of phoneme frames and a judging probability of each phoneme;

and under the condition that invalid phonemes and/or missing phonemes in the first phoneme sequence are determined to meet a preset condition, determining a wake-up result of the voice information according to the first phoneme sequence and the judging probabilities of a plurality of phonemes in the first phoneme sequence.

Optionally, the inputting the plurality of phoneme frames into a preset acoustic model to obtain a first phoneme sequence, where the first phoneme sequence includes a plurality of phonemes corresponding to the plurality of phoneme frames, and a decision probability of each phoneme includes:

inputting a plurality of phoneme frames of the first phoneme sequence into the acoustic model frame by frame to obtain a plurality of phoneme sequences output by the acoustic model, wherein the plurality of phoneme sequences are in one-to-one correspondence with the plurality of phoneme frames, and each phoneme sequence comprises at least one phoneme and a judging probability of the at least one phoneme;

For each of the plurality of phoneme frames, determining a phoneme with highest judging probability in the phoneme sequence corresponding to the phoneme frame as a phoneme corresponding to the phoneme frame;

and taking a plurality of phonemes corresponding to the obtained plurality of phoneme frames as the first phoneme sequence.

Optionally, for each of the plurality of phoneme frames, determining a phoneme with the highest decision probability in the phoneme sequence corresponding to the phoneme frame as a phoneme corresponding to the phoneme frame includes:

normalizing the judgment probability of each phoneme in the phoneme sequence corresponding to the phoneme frame to obtain the normalized judgment probability of each phoneme;

and determining the phoneme with the highest judgment probability after normalization processing in the phoneme sequence as a phoneme corresponding to the phoneme frame.

Optionally, the inactive phonemes include an orphan phoneme and an extra phoneme, and the method further comprises:

acquiring a second phoneme sequence corresponding to preset standard wake-up voice information;

determining whether the isolated phonemes, the extra phonemes and/or the missing phonemes are present in the first phoneme sequence based on the second phoneme sequence; wherein any phoneme is not related to the previous adjacent phoneme and/or the next adjacent phoneme, is the additional phoneme when not related to the previous adjacent phoneme and/or the next adjacent phoneme, is related to the previous adjacent phoneme and/or the next adjacent phoneme when not related to the second phoneme sequence, and is the missing phoneme when not related to the first phoneme sequence;

In the case that the isolated phonemes, the extra phonemes and/or the missing phonemes are determined to exist in the first phoneme sequence, whether the isolated phonemes, the extra phonemes and/or the missing phonemes meet the preset condition is determined.

Optionally, the preset condition includes:

the number of orphaned phones is less than a first number threshold, the number of additional phones is less than a second number threshold, and the number of missing phones is less than a third number threshold.

Optionally, in the case that it is determined that the invalid phoneme in the first phoneme sequence and/or the missing phoneme meets a preset condition, determining a wake-up result of the speech information according to the first phoneme sequence and the decision probabilities of the plurality of phonemes in the first phoneme sequence includes:

removing phonemes of which the judging probabilities are smaller than a probability threshold value from a plurality of phonemes in the first phoneme sequence to obtain a third phoneme sequence;

acquiring a sum of judging probabilities of a plurality of phonemes in the third phoneme sequence;

under the condition that the sum of the judging probabilities is smaller than a total probability threshold value, judging that the voice information is awakened to fail;

And determining that the voice information is successfully awakened under the condition that the sum of the judging probabilities is greater than or equal to the total probability threshold value.

Optionally, the inputting the plurality of phoneme frames into the acoustic model frame by frame to obtain a plurality of phoneme sequences output by the acoustic model includes:

extracting a feature vector of each of the plurality of phoneme frames;

and inputting the feature vectors of the plurality of phoneme frames of the first phoneme sequence into the acoustic model in a frame-by-frame input mode to obtain the plurality of phoneme sequences output by the acoustic model.

According to a second aspect of embodiments of the present disclosure, there is provided a device wake-up apparatus, comprising:

the acquisition module is configured to acquire a plurality of phoneme frames according to the received voice information, wherein the duration of each phoneme frame is a preset duration;

an identification module configured to input the plurality of phoneme frames into an acoustic model to obtain a first phoneme sequence, wherein the first phoneme sequence comprises a plurality of phonemes corresponding to the plurality of phoneme frames and a decision probability of each phoneme;

and the processing module is configured to determine a wake-up result of the voice information according to the first phoneme sequence and the judgment probabilities of a plurality of phonemes in the first phoneme sequence under the condition that the invalid phonemes and/or the missing phonemes in the first phoneme sequence are determined to meet the preset condition.

Optionally, the identification module includes: identifying a sub-module and determining a sub-module;

the recognition sub-module is configured to input the plurality of phoneme frames into the acoustic model frame by frame to obtain a plurality of phoneme sequences output by the acoustic model, wherein the plurality of phoneme sequences are in one-to-one correspondence with the plurality of phoneme frames, and each phoneme sequence comprises at least one phoneme and a judging probability of the at least one phoneme;

the determining submodule is configured to determine, for each of the plurality of phoneme frames, a phoneme with the highest decision probability in the phoneme sequence corresponding to the phoneme frame as a phoneme corresponding to the phoneme frame; and taking a plurality of phonemes corresponding to the obtained plurality of phoneme frames as the first phoneme sequence.

Optionally, the determining submodule is configured to:

Optionally, the inactive phonemes include an orphan phoneme and an extra phoneme, and the apparatus further includes: a preprocessing module configured to:

Optionally, the preset condition includes:

Optionally, the processing module is configured to:

Optionally, the identifying sub-module is configured to:

extracting a feature vector of each of the plurality of phoneme frames;

According to a third aspect of embodiments of the present disclosure, there is provided a device wake-up apparatus, comprising:

A processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the steps of the method of any one of the first aspects are carried out when the executable instructions are executed.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions, characterized in that the program instructions when executed by a processor implement the steps of the method of any of the first aspects.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects:

in the above technical solution, a plurality of phoneme frames are obtained according to received voice information, the duration of each phoneme frame is a preset duration, and then the plurality of phoneme frames are input into a preset acoustic model to obtain a first phoneme sequence, the first phoneme sequence includes a plurality of phonemes corresponding to the plurality of phoneme frames and a judgment probability of each phoneme, and then a wake-up result of the voice information is determined according to the first phoneme sequence and the judgment probabilities of the plurality of phonemes in the first phoneme sequence when determining that invalid phonemes and/or missing phonemes in the first phoneme sequence meet preset conditions. By the implementation mode, the phonemes in the voice information can be identified under the condition that a complex decoder is not arranged, the requirement of the complex decoder on hardware can be avoided, and therefore the phoneme-level voice awakening technology can be simplified.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.

Fig. 1 is a flow chart illustrating a device wake-up method according to an exemplary embodiment.

Fig. 2 is a flowchart illustrating a phoneme recognition method according to an exemplary embodiment.

Fig. 3 is a flowchart illustrating another phoneme recognition method according to an exemplary embodiment.

Fig. 4 is a block diagram illustrating a device wake-up apparatus according to an exemplary embodiment.

Fig. 5 is a block diagram illustrating a device wake-up apparatus according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

In the related art, for the phoneme-level voice wake-up technology, the basic flow generally includes two stages, namely a training stage and an identification stage. The training stage is to obtain an acoustic model and a language model through training according to a voice database and a text database respectively. The recognition stage is to input the acoustic model and the language model according to the received voice to obtain a voice recognition result, and decode the recognition result through a decoder to obtain the characters corresponding to the input voice.

Among these decoders, WFST (Weighted Finite State Transducers, finite weighted state transducer) decoders are commonly used. Before decoding using the WFST, a static decoding graph HCLG (HMM Context-dependency Lexicon Grammar), referred to as an acoustic model (HMM), context-dependency, dictionary (Lexicon), and language model (gray), needs to be built for the decoder, wherein HMM refers to a hidden markov model (Hidden Markov Model). The dictionary, i.e. pronunciation dictionary, is a mapping relation between phonemes and words, and is used for connecting the acoustic model and the language model, and the speech model is output according to the acoustic model (phonemes) and combines the pronunciation dictionary to give a text sequence.

The main idea of the voice recognition decoding based on WFST is to respectively express three different layers of models of an acoustic model, a language model and a pronunciation dictionary in the form of finite state transducers, and integrate the three different layers of finite state transducers into a single weighted finite state transducer to form a decoding network/searching network.

However, the above-mentioned method of decoding by WFST has a complex and slow construction process of the decoding graph HCLG, which is more suitable for speech recognition scenes requiring a text database support, rather than recognizing speech wake-up scenes that are fixed, have few phonemes, but require quick response. On the other hand, the decoding graph HCLG used in WFST decoding occupies a large space, occupies a large memory in the decoding operation process, has high requirements on hardware, and is difficult to be directly deployed on mobile terminal hardware. Therefore, in order to solve the above-mentioned problems, the present disclosure provides a device wake-up method, which can be applied to an electronic device, and the electronic device can be any electronic device supporting voice wake-up, for example, a mobile terminal such as a smart speaker, a smart phone, a tablet computer, a smart television, a smart wearable device, a PDA (Personal Digital Assistant, a personal digital assistant), a portable computer, or a fixed terminal such as a desktop computer. The device wake-up method is described below.

Fig. 1 is a flowchart illustrating a device wake-up method according to an exemplary embodiment, and as shown in fig. 1, the device wake-up method may be used in the above electronic device, and includes the following steps.

In step S11, a plurality of phoneme frames are acquired according to the received voice information, and the duration of each phoneme frame is a preset duration.

Wherein, the phonemes (phones) are the minimum phonetic units divided according to the natural attribute of the voice, and are analyzed according to the pronunciation actions in syllables, one action forms a phoneme, and the phonemes are divided into two major classes of vowels and consonants. The electronic device may collect, through the microphone, voice information sent by the user, where the voice information may be a continuous voice information stream having a certain duration, and the voice information may be divided into a plurality of frames according to a preset duration, where each frame corresponds to a phoneme, so that it may be understood that a frame includes a phoneme, and may be referred to as a phoneme frame. For example, the preset duration may be 25 milliseconds, so that the voice information may be divided into a plurality of phoneme frames according to the duration of the received voice information.

In step S12, the plurality of phoneme frames are input into a preset acoustic model to obtain a first phoneme sequence, where the first phoneme sequence includes a plurality of phonemes corresponding to the plurality of phoneme frames and a decision probability of each phoneme.

The acoustic model may be a trained acoustic model, and may be, for example, the HMM model described above. After a plurality of phoneme frames in the first phoneme sequence are input into the acoustic model, a phoneme corresponding to each phoneme frame and a judging probability of each phoneme can be obtained. The decision probability can characterize the likelihood that the phoneme frame is determined to be the phoneme to which the phoneme frame corresponds. For example, if the phoneme corresponding to the phoneme frame 1 output by the acoustic model is the phoneme a and the decision probability is 0.78, it is understood that the phoneme corresponding to the phoneme frame 1 is 78% likely to be the phoneme a.

In step S13, in the case where it is determined that the invalid phonemes and/or missing phonemes in the first phoneme sequence meet the preset condition, a wake-up result of the speech information is determined according to the first phoneme sequence and the decision probabilities of the plurality of phonemes in the first phoneme sequence.

It will be appreciated that the presence of inactive and/or missing phones in the phone sequence has an impact on the result of speech recognition, and that the more inactive and/or missing phones the greater the impact, and therefore the determination of whether an inactive and/or missing phone is present in the first phone sequence, and the number of such present phones, is required before further decoding of the first phone sequence.

For example, a certain rule may be set in advance as the preset condition according to whether or not the presence of invalid phonemes and/or missing phonemes is permitted, and the number of invalid phonemes and/or missing phonemes permitted to be present. So that step S13 is performed in case the invalid phone and/or the missing phone meets a preset condition. In step S13, the decoding of the first phoneme sequence may be performed by: comparing the judging probabilities of a plurality of phonemes in the first phoneme sequence with a preset probability threshold respectively, reserving the phonemes reaching the probability threshold, removing the phonemes lower than the probability threshold, judging whether the sum of the judging probabilities of the rest phonemes meets a certain condition, if yes, determining that the voice information is successfully awakened, and executing the step of awakening the electronic equipment.

Fig. 2 is a flowchart of a phoneme recognition method according to an exemplary embodiment, as shown in fig. 2, the inputting the plurality of phoneme frames into a preset acoustic model in step S12 to obtain a first phoneme sequence, where the first phoneme sequence includes a plurality of phonemes corresponding to the plurality of phoneme frames, and a decision probability of each phoneme may include the following steps:

Step S121, inputting the plurality of phoneme frames into the acoustic model frame by frame to obtain a plurality of phoneme sequences output by the acoustic model.

In one implementation, feature extraction may be performed on a plurality of phoneme frames of the speech information, and then the extracted feature of each phoneme frame may be streamed into the acoustic model, which may illustratively include the steps of:

first, a feature vector of each of a plurality of phoneme frames of the speech information is extracted, and a method of extracting the feature vector is not limited in the present disclosure.

And secondly, inputting the feature vectors of the plurality of phoneme frames of the first phoneme sequence into the acoustic model in a frame-by-frame input mode to obtain a plurality of phoneme sequences output by the acoustic model.

The acoustic model comprises a plurality of input phoneme frames, a plurality of phoneme sequences and a judgment probability, wherein the plurality of phoneme sequences output by the acoustic model are in one-to-one correspondence with the plurality of input phoneme frames, and each phoneme sequence comprises at least one phoneme and the judgment probability of the at least one phoneme.

Step S122, for each of the plurality of phoneme frames, determining a phoneme with the highest decision probability in the phoneme sequence corresponding to the phoneme frame as a phoneme corresponding to the phoneme frame.

Step S123, taking the obtained plurality of phonemes corresponding to the plurality of phoneme frames as the first phoneme sequence.

For example, assume that the wake-up word of the electronic device a is "xiaoling", and the corresponding phoneme sequences are "x", "iao3", "m", "ing2", "t", "ong2", "x", "ve2", and the numerals in the phonemes represent pinyin tones. To wake up the electronic device a, the user needs to speak a wake-up word in terms of the phoneme. The user speaks a continuous piece of voice information, which is divided into a plurality of phoneme frames, as described above, and the acoustic model outputs a phoneme sequence corresponding to each phoneme frame during the process of inputting the acoustic model into the acoustic model from phoneme frame to phoneme frame, and each phoneme sequence includes at least one phoneme, and a decision probability of the at least one phoneme. The wake-up words are only exemplary, and may be set to any other wake-up words according to actual needs, or may be set by a user.

For example, the plurality of phoneme frames included in the speech information are phoneme frame 1, phoneme frame 2, and phoneme frame 3, respectively. After inputting the phoneme frames 1-4 into the acoustic model frame by frame, the phoneme sequence corresponding to the phoneme frame 1 is assumed to be: "x" and "t", the judgment probabilities are 0.85 and 0.15, respectively; the phoneme sequence corresponding to the phoneme frame 2 is: "x", "iao3", "t", the judgment probabilities are 0.1, 0.85, 0.05, respectively; the phoneme sequence corresponding to the phoneme frame 2 is: "m", "in2", "ing2", the judgment probabilities were 0.07, 0.15, and 0.78, respectively. Then, since the phonemes of the decision probability are selected from the phoneme sequences corresponding to the respective phoneme frames 2, it can be obtained that the phoneme corresponding to the phoneme frame 1 is "x", the phoneme corresponding to the phoneme frame 2 is "iao3", and the phoneme corresponding to the phoneme frame 3 is "ing2". The first phoneme sequence thus obtained is: "x", "iao3", "ing2".

Further, fig. 3 is a flowchart of another phoneme recognition method according to an exemplary embodiment, as shown in fig. 3, for each of the plurality of phoneme frames, determining, as a phoneme corresponding to the phoneme frame, a phoneme with the highest decision probability in a phoneme sequence corresponding to the phoneme frame in step S122 may include the following steps:

step S1221, normalizing the judgment probability of each phoneme in the phoneme sequence corresponding to the phoneme frame to obtain the normalized judgment probability of each phoneme.

In step S1222, the phoneme with the highest judgment probability after normalization processing in the phoneme sequence is determined as the phoneme corresponding to the phoneme frame.

Since there may be unrecognized phonemes and a quiet segment may be encountered in the process of recognizing phonemes through the above-described acoustic model, the unrecognized phonemes may be marked as "unk" and the quiet segment may be marked as "sil". Since "unk" and "sil" may be present, the following may occur: the phoneme sequence corresponding to a certain phoneme frame has "unk" and/or "sil", and its decision probability is larger than that of other phonemes in the phoneme sequence. Therefore, in order to avoid the interference of "unk" and "sil", the decision probability of each phoneme in the phoneme sequence corresponding to the phoneme frame may be normalized first, where "unk" and/or "sil" may be ignored in the normalization process.

For example, assume that a phoneme sequence corresponding to a certain phoneme frame is: the probabilities of determination of "m", "in2", "ing2", and "unk" are 0.07, 0.15, 0.38, and 0.4, respectively, and the probability of determination of "unk" is 0.4 before normalization, and is the maximum value in the phoneme sequence, and if normalization is not performed, the phoneme corresponding to the phoneme frame is determined to be "unk".

The normalization method comprises the following steps: dividing the decision probability of the phoneme i by the sum of probabilities of other phonemes except for 'unk' and 'sil' in the phoneme sequence of the phoneme i, wherein the obtained quotient is used as the normalized decision probability of the phoneme i, and the phoneme i is any phoneme except for 'unk' and 'sil' in the phoneme sequence of the phoneme sequence.

Based on the above method, it can be determined that:

"m" normalized decision probability=0.07 ≡ (0.07+0.15+0.38) =0.12

"in2" normalized decision probability=0.15 ≡ (0.07+0.15+0.38) =0.25

"ing2" normalized decision probability=0.38 ≡ (0.07+0.15+0.38) =0.63

Therefore, the probability of judgment of "ing2" after normalization is maximum, and the phoneme corresponding to the phoneme frame can be judged as "ing2".

Optionally, the inactive phonemes described in step S13 may include an isolated phoneme and an extra phoneme, where any phoneme is an isolated phoneme when it is not related to a previous adjacent phoneme and/or a next adjacent phoneme, any phoneme is not related to a previous adjacent phoneme and/or a next adjacent phoneme, and is an extra phoneme when it is not related to a previous adjacent phoneme and/or a next adjacent phoneme, and is a missing phoneme when it is not related to a first phoneme. For any phoneme, the preceding adjacent phoneme of the phoneme refers to the preceding phoneme adjacent to the phoneme in the sequence of the phoneme, and the following adjacent phoneme of the phoneme refers to the following phoneme adjacent to the phoneme in the sequence of the phoneme.

Before the step S13, the method may further include: determining whether invalid phones and/or missing phones exist in the first phone sequence may include the steps of:

firstly, a second phoneme sequence corresponding to preset standard wake-up voice information is obtained.

And secondly, determining the isolated phonemes, the extra phonemes and/or the missing phonemes in the first phoneme sequence according to the second phoneme sequence, namely determining whether one or more of the isolated phonemes, the extra phonemes and the missing phonemes exist in the first phoneme sequence.

And determining whether the isolated phoneme, the additional phoneme and/or the missing phoneme meet the preset condition or not in the case that the isolated phoneme, the additional phoneme and/or the missing phoneme exist in the first phoneme sequence.

The preset conditions in the step S13 include: the number of orphaned phones is less than a first number threshold, the number of extra phones is less than a second number threshold, and the number of missing phones is less than a third number threshold. I.e. it can be understood that for any phoneme sequence the number of isolated, extra and missing phonemes in the phoneme sequence cannot exceed a certain number.

For example, taking the above "Ming-Sum" as an example, the second phoneme sequence may be "x", "iao3", "m", "ing2", "t", "ong2", "x", "ve2", which is a phoneme sequence corresponding to the standard wake-up speech information, and referring to this sequence, it may be determined whether there are an isolated phoneme, an extra phoneme, and a missing phoneme in the first phoneme sequence. For example, if the first phoneme sequence is "x", "iao3", "t", "m", "ing2", "t", "ong2", "x", "ve2", then it is known from the second phoneme sequence that the preceding adjacent phoneme of the phoneme "t" is "ing2", the following adjacent phoneme bit "ong2", and the phonemes associated with the phoneme "t" are "ong2", and the phonemes together constitute the pronunciation of the "same" word, thereby it can be determined that the third phoneme "t" in the first phoneme sequence belongs to an isolated phoneme.

If the first phoneme sequence is "x", "iao3", "t", "m", "ing2", "t", "ong2", "x", "ve2", it is known from the same judgment principle as above that the third and fourth phonemes are two consecutive phonemes "t", which are not related to the previous adjacent phoneme "iao3" and the next adjacent phoneme bit "m", but do not belong to the isolated phonemes again, so that it can be determined that the two consecutive phonemes "t" of the third and fourth phonemes in the first phoneme sequence belong to the additional phonemes.

If the first phoneme sequence is "x", "iao3", "m", "ing2", "t", "ong2", "ve2", the second phoneme sequence knows that there should be "x" between "ong2" and "ve2", so it can be determined that there is a missing phoneme "x" in the first phoneme sequence.

The isolated, extra and/or missing phones may be labeled after determining the isolated, extra and/or missing phones present in the first phone sequence.

The first number threshold, the second number threshold, and the third number threshold may be the same or different, for example, may be set to 2, may be set according to actual requirements, and is not limited in this disclosure.

Optionally, in the case that it is determined in step S13 that the invalid phonemes and/or missing phonemes in the first phoneme sequence meet the preset condition, determining a wake-up result of the speech information according to the first phoneme sequence and the decision probabilities of the plurality of phonemes in the first phoneme sequence, including the following steps:

removing phonemes with judging probabilities smaller than a probability threshold value from a plurality of phonemes in the first phoneme sequence to obtain a third phoneme sequence;

obtaining a sum of decision probabilities of a plurality of phonemes in the third phoneme sequence;

under the condition that the sum of the judging probabilities is smaller than the total probability threshold, judging that the voice information is awakened to fail;

and determining that the voice information wakes up successfully in the case that the sum of the judging probabilities is greater than or equal to a total probability threshold.

For example, assuming that the speech information includes 10 phoneme frames, respectively labeled as phoneme frames 1-10, the first phoneme sequence of the phoneme frames 1-10 obtained after the step 102 includes 10 phonemes, "x", "iao3", "m", "ing2", "t", "ong2", "x", "ve2", "t", and the decision probabilities of the phonemes are respectively: 0.78, 0.85, 0.96, 0.52, 0.88, 0.72, 0.65, 0.91, 0.89, 0.45. In the case where the probability threshold is set to 0.6, it may be determined that the decision probability of the fourth phoneme "m" and the tenth phoneme "t" is smaller than the probability threshold, so that the two phonemes are removed, and the remaining phonemes are retained, so that the resulting third phoneme sequence is: "x", "iao3", "m", "ing2", "t", "ong2", "x", "ve2" total 8 phonemes.

For example, the total probability threshold may be set to 5, and the sum of the decision probabilities of the individual phonemes of the third phoneme sequence=0.78+0.85+0.96+0.88+0.72+0.65+0.91+0.89=6.64 >5, so that it may be determined that the voice information wakes up successfully, thereby performing an operation of waking up the electronic device.

The probability threshold and the total probability threshold are exemplary, and the values thereof can be set according to actual needs.

In addition, it should be noted that, instead of determining the determination probability of a single phoneme, whether the wake-up is successful may be determined directly by the sum of the determination probabilities of the phonemes. Alternatively, instead of determining whether the sum of the determination probabilities is greater than the total probability threshold, only the determination probability of a single phoneme may be determined to determine whether the wake-up is successful, for example, a pass rate threshold of one phoneme sequence may be preset, for example, set to 70%, and greater than the pass rate threshold, the wake-up may be determined to be successful. Then, since the 10 phonemes included in the first phoneme sequence have 8 phonemes greater than the probability threshold and the passing rate is 80%, it can be determined that the voice information is awakened successfully.

By the method, a complex decoding diagram does not need to be constructed, so that the decoder constructed based on the rule shown in the step S13 is simple to realize, and has low requirements on storage space and operation content, thereby realizing accurate recognition of phonemes under the condition of reducing the requirements on hardware.

In the above technical solution, a plurality of phoneme frames are obtained according to received voice information, the duration of each phoneme frame is a preset duration, the plurality of phoneme frames are input into a preset acoustic model to obtain a first phoneme sequence, the first phoneme sequence includes a plurality of phonemes corresponding to the plurality of phoneme frames and a judgment probability of each phoneme, and then a wake-up result of the voice information is determined according to the first phoneme sequence and the judgment probabilities of the plurality of phonemes in the first phoneme sequence when determining that invalid phonemes and/or missing phonemes in the first phoneme sequence meet preset conditions. By the implementation mode, the phonemes in the voice information can be identified under the condition that a complex decoder is not arranged, the requirement of the complex decoder on hardware can be avoided, and therefore the phoneme-level voice awakening technology can be simplified.

Fig. 4 is a block diagram of a device wake-up unit, according to an example embodiment. Referring to fig. 4, the apparatus 400 includes an acquisition module 410, an identification module 420, and a processing module 430.

The acquiring module 410 is configured to acquire a plurality of phoneme frames according to the received voice information, where a duration of each phoneme frame is a preset duration;

The recognition module 420 is configured to input the plurality of phoneme frames into an acoustic model to obtain a first phoneme sequence, wherein the first phoneme sequence comprises a plurality of phonemes corresponding to the plurality of phoneme frames and a decision probability of each of the phonemes;

the processing module 430 is configured to determine a wake-up result of the speech information according to the first phoneme sequence and the decision probabilities of the plurality of phonemes in the first phoneme sequence, if it is determined that the invalid phonemes and/or the missing phonemes in the first phoneme sequence meet the preset condition.

Optionally, the identification module 420 includes: identifying a sub-module and determining a sub-module;

the recognition sub-module is configured to input a plurality of phoneme frames of the first phoneme sequence into the acoustic model frame by frame to obtain a plurality of phoneme sequences output by the acoustic model, wherein the plurality of phoneme sequences are in one-to-one correspondence with the plurality of phoneme frames, and each phoneme sequence comprises at least one phoneme and a judging probability of the at least one phoneme;

the determining submodule is configured to determine, for each of the plurality of phoneme frames, a phoneme with the highest decision probability in the phoneme sequence corresponding to the phoneme frame as a phoneme corresponding to the phoneme frame; and taking the obtained phonemes corresponding to the phoneme frames as the first phoneme sequence.

Optionally, the determining submodule is configured to:

normalizing the judging probability of each phoneme in the phoneme sequence corresponding to the phoneme frame to obtain the normalized judging probability of each phoneme;

Optionally, the inactive phonemes include an orphan phoneme and an extra phoneme, and the device wake-up apparatus further includes: a preprocessing module configured to:

determining the isolated phonemes, the extra phonemes and/or the missing phonemes in the first phoneme sequence based on the second phoneme sequence; wherein, when any phoneme is not related to the previous adjacent phoneme and/or the next adjacent phoneme, the isolated phoneme is not related to the previous adjacent phoneme and/or the next adjacent phoneme, and when the isolated phoneme is not related to the additional phoneme, the phoneme existing in the second phoneme sequence is related to the previous adjacent phoneme and/or the next adjacent phoneme, and when the isolated phoneme is not present in the first phoneme sequence, the isolated phoneme is the missing phoneme;

in case it is determined that the isolated phone, the extra phone and/or the missing phone are present in the first phone sequence, it is determined whether the isolated phone, the extra phone and/or the missing phone fulfill the preset condition.

Optionally, the preset condition includes:

Optionally, the processing module is configured to:

removing phonemes with the judging probability smaller than a probability threshold value from a plurality of phonemes in the first phoneme sequence to obtain a third phoneme sequence;

Optionally, the identification sub-module is configured to:

extracting a feature vector of each of the plurality of phoneme frames;

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the device wake-up method provided by the present disclosure.

Fig. 5 is a block diagram illustrating a device wake-up apparatus 500 according to an exemplary embodiment. For example, the apparatus 500 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, or the like.

Referring to fig. 5, an apparatus 500 may include one or more of the following components: a processing component 502, a memory 504, a power component 506, a multimedia component 508, an audio component 510, an input/output (I/O) interface 512, a sensor component 514, and a communication component 516.

The processing component 502 generally controls overall operation of the apparatus 500, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 502 may include one or more processors 520 to execute instructions to perform all or part of the steps of the device wake-up method described above. Further, the processing component 502 can include one or more modules that facilitate interactions between the processing component 502 and other components. For example, the processing component 502 can include a multimedia module to facilitate interaction between the multimedia component 508 and the processing component 502.

The memory 504 is configured to store various types of data to support operations at the apparatus 500. Examples of such data include instructions for any application or method operating on the apparatus 500, contact data, phonebook data, messages, pictures, videos, and the like. The memory 504 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power component 506 provides power to the various components of the device 500. The power components 506 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 500.

The multimedia component 508 includes a screen between the device 500 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 508 includes a front-facing camera and/or a rear-facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the apparatus 500 is in an operational mode, such as a photographing mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 510 is configured to output and/or input audio signals. For example, the audio component 510 includes a Microphone (MIC) configured to receive external audio signals when the device 500 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 504 or transmitted via the communication component 516. In some embodiments, the audio component 510 further comprises a speaker for outputting audio signals.

The I/O interface 512 provides an interface between the processing component 502 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 514 includes one or more sensors for providing status assessment of various aspects of the apparatus 500. For example, the sensor assembly 514 may detect the on/off state of the device 500, the relative positioning of the components, such as the display and keypad of the device 500, the sensor assembly 514 may also detect a change in position of the device 500 or a component of the device 500, the presence or absence of user contact with the device 500, the orientation or acceleration/deceleration of the device 500, and a change in temperature of the device 500. The sensor assembly 514 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 514 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 514 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 516 is configured to facilitate communication between the apparatus 500 and other devices in a wired or wireless manner. The apparatus 500 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one exemplary embodiment, the communication component 516 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 516 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 500 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for performing the above device wake-up method.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 504, including instructions executable by processor 520 of apparatus 500 to perform the device wake-up method described above. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-described device wake-up method when executed by the programmable apparatus.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method of waking up a device, comprising:

2. The method of claim 1, wherein inputting the plurality of phoneme frames into a preset acoustic model to obtain a first phoneme sequence, the first phoneme sequence comprising a plurality of phonemes corresponding to the plurality of phoneme frames, and a decision probability for each of the phonemes, comprises:

inputting the plurality of phoneme frames into the acoustic model frame by frame to obtain a plurality of phoneme sequences output by the acoustic model, wherein the plurality of phoneme sequences are in one-to-one correspondence with the plurality of phoneme frames, and each phoneme sequence comprises at least one phoneme and a judging probability of the at least one phoneme;

3. The method according to claim 2, wherein for each of the plurality of phoneme frames, determining a phoneme with a highest decision probability in the phoneme sequence corresponding to the phoneme frame as a phoneme corresponding to the phoneme frame comprises:

4. The method of claim 1, wherein the inactive phones include orphaned phones and extra phones, the method further comprising:

5. The method of claim 4, wherein the preset conditions include:

6. The method according to claim 1, wherein the determining the wake-up result of the speech information according to the first phoneme sequence and the decision probabilities of the plurality of phonemes in the first phoneme sequence in the case that it is determined that the invalid phonemes and/or the missing phonemes in the first phoneme sequence meet a preset condition includes:

7. The method of claim 2, wherein inputting the plurality of phoneme frames into the acoustic model frame-by-frame to obtain a plurality of phoneme sequences output by the acoustic model comprises:

extracting a feature vector of each of the plurality of phoneme frames;

8. A device wake-up apparatus, comprising:

9. The apparatus of claim 8, wherein the identification module comprises: identifying a sub-module and determining a sub-module;

10. The apparatus of claim 9, wherein the determination submodule is configured to:

11. The apparatus of claim 8, wherein the inactive phones include orphaned phones and extra phones, the apparatus further comprising: a preprocessing module configured to:

12. The apparatus of claim 11, wherein the preset condition comprises:

13. The apparatus of claim 8, wherein the processing module is configured to:

14. The apparatus of claim 9, wherein the identification sub-module is configured to:

extracting a feature vector of each of the plurality of phoneme frames;

15. A device wake-up apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the steps of the method of any one of claims 1-7 are implemented when the executable instructions are executed.

16. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1-7.