CN113470646B

CN113470646B - Voice awakening method, device and equipment

Info

Publication number: CN113470646B
Application number: CN202110745052.1A
Authority: CN
Inventors: 梁镇麟; 董林昊; 蔡猛; 马泽君
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2023-10-20
Anticipated expiration: 2041-06-30
Also published as: CN113470646A

Abstract

The embodiment of the application discloses a voice awakening method, a voice awakening device and voice awakening equipment. And extracting a text sequence from the voice signal to be processed for the acquired voice signal to be processed. After the character sequence is acquired, sliding windows are carried out on a decoding diagram formed by the character sequence according to the length of the wake-up word, and the confidence level of the wake-up word appearing in each sliding window is determined. And meanwhile, decoding a search graph formed by the text sequence by utilizing Viterbi decoding to obtain the confidence of the Viterbi path. When the confidence coefficient of the wake-up word appearing in a certain sliding window in the sliding window operation meets a first preset condition, waking up the equipment, and terminating the sliding window operation and the Viterbi decoding operation. If the confidence coefficient of sliding window decoding does not meet the first preset condition, continuing to judge whether to wake up by utilizing the confidence coefficient obtained by Viterbi decoding. Therefore, the above operation does not affect the decoding speed of sliding window decoding, and the recall rate can be further improved due to the increase of viterbi decoding.

Description

Voice awakening method, device and equipment

Technical Field

The present application relates to the field of computer processing technologies, and in particular, to a method, an apparatus, and a device for waking up speech.

Background

Voice wake-up is a very important technology in the current voice field, and its main function is to enter a working state after receiving a voice command from a user and execute an operation indicated by the voice command from the user. The traditional wake-up method is to decode and recall at the level of speech frames, and the decoding speed is slow due to the existence of a large number of speech frames. If the decoding speed is increased, the recall rate may be reduced, thereby affecting the wake-up of the device. Based on this, how to increase the recall without affecting the decoding speed is an urgent issue to be resolved.

Disclosure of Invention

In view of this, the embodiments of the present application provide a method, apparatus, and device for waking up speech, so as to improve the recall rate without affecting the decoding speed.

In order to achieve the above object, the technical solution provided by the embodiments of the present application is as follows:

in a first aspect of the embodiment of the present application, a voice wake-up method is provided, where the method includes:

acquiring a voice signal to be processed, and extracting a text sequence from the voice signal to be processed;

performing sliding window decoding operation on a decoding diagram formed by the text sequence according to the length of the wake-up word, and determining the confidence level of the wake-up word in each sliding window, wherein the wake-up word is used for waking up equipment;

When the confidence coefficient of the wake-up word appearing in the sliding window meets a first preset condition, waking up the equipment;

and terminating a sliding window decoding operation and a Viterbi decoding operation, wherein the Viterbi decoding operation is used for decoding the text sequence according to the wake-up word.

In a second aspect of the embodiment of the present application, there is provided a voice wake-up device, the device including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a voice signal to be processed and extracting a text sequence from the voice signal to be processed;

the determining unit is used for sliding windows on the decoding diagram formed by the character sequences according to the length of the wake-up words, and determining the confidence level of the wake-up words in each sliding window, wherein the wake-up words are used for waking up equipment;

the wake-up unit is used for waking up the equipment when the confidence coefficient of the wake-up word appearing in the sliding window meets a first preset condition;

and the termination unit is used for terminating the sliding window operation and the Viterbi decoding operation, and the Viterbi decoding operation is used for decoding the text sequence according to the wake-up word.

In a third aspect of the embodiment of the present application, there is provided an electronic device, including: a processor and a memory; the memory is used for storing instructions or computer programs; the processor is configured to execute the instructions or the computer program in the memory, so that the electronic device performs the method according to the first aspect.

In a fourth aspect of embodiments of the present application, there is provided a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

From this, the embodiment of the application has the following beneficial effects:

in the embodiment of the application, for the acquired voice signal to be processed, the text sequence is extracted from the voice signal to be processed. After the character sequence is acquired, sliding windows are carried out on a decoding diagram formed by the character sequence according to the length of the wake-up word, and the confidence level of the wake-up word appearing in each sliding window is determined. And meanwhile, decoding a search graph formed by the text sequence by utilizing Viterbi decoding to obtain the confidence of the Viterbi path. When the confidence coefficient of the wake-up word appearing in a certain sliding window in the sliding window operation meets a first preset condition, waking up the equipment, and terminating the sliding window operation and the Viterbi decoding operation. That is, the embodiment of the application performs sliding window decoding and viterbi decoding on the extracted text sequence at the same time, preferentially uses the confidence coefficient obtained by the sliding window decoding to judge whether to wake up the equipment, and if the confidence coefficient of the sliding window decoding meets the first preset condition, wakes up the equipment, and terminates the sliding window decoding and the viterbi decoding. If the confidence coefficient of sliding window decoding does not meet the first preset condition, continuing to judge whether to wake up by utilizing the confidence coefficient obtained by Viterbi decoding. Therefore, the above operation does not affect the decoding speed of sliding window decoding, and the recall rate can be further improved due to the increase of viterbi decoding.

Drawings

FIG. 1 is a flowchart of a voice wake-up method according to an embodiment of the present application;

FIG. 2 is a flowchart of another voice wake-up method according to an embodiment of the present application;

FIG. 3a is a schematic diagram illustrating a sliding window decoding according to an embodiment of the present application;

FIG. 3b is a schematic view of a sliding window according to an embodiment of the present application;

fig. 4a is a viterbi decoding schematic diagram according to an embodiment of the application;

fig. 4b is a schematic diagram of a viterbi decoding application scenario according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a voice wake-up device according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of embodiments of the application will be rendered by reference to the appended drawings and appended drawings. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting thereof. In addition, for convenience of description, only a part, not all, of the structures related to the present application are shown in the drawings.

In order to facilitate understanding of the technical scheme provided by the embodiment of the application, technical terms related to the application will be described first.

Voice wake up (KWS)) refers to the detection of speaker-specific fragments in real time in a continuous speech stream. The purpose of voice wake-up is to activate the device from a sleep state to an operational state, so wake-up words can be detected immediately after being spoken, and the user experience is better. The wake-up effect is typically evaluated by the following 4 indices, namely wake-up rate, false wake-up rate, response time and power consumption level. The wake-up rate refers to the success rate of user interaction, and the technical term is recall rate, namely recycle. False wake-up, probability that the user is not interacting and the device is wake-up. Response time refers to the time difference from when the user speaks the wake-up word to when the device gives feedback. The power consumption level, i.e. the power consumption of the wake-up system. Many intelligent devices are powered by batteries, and need to meet long-term endurance, and are more conscious of the power consumption level.

Typically, to increase the wake up rate, a large number of speech frames will be decoded to identify whether the speech signal includes wake up words, resulting in a slower decoding speed, i.e. a longer response time, due to the presence of a large number of speech frames. In order to improve the decoding speed, the wake-up rate may be reduced, so that the device cannot be normally woken up, and the use experience of the user is affected.

Based on the above, the embodiment of the application provides a voice wake-up method, which extracts a text sequence from a voice signal to be processed after the voice signal to be processed is obtained. And carrying out sliding window decoding on a decoding diagram formed by the character sequences according to the length of the wake-up words, and determining the confidence coefficient of the wake-up words in each sliding window. Wherein the wake-up word is used to wake up the device. That is, the probability of including wake words within each sliding window is determined using a sliding window decoding operation. Meanwhile, decoding a search graph formed by the text sequence by utilizing Viterbi decoding to obtain the confidence of the Viterbi path. I.e. sliding window decoding and viterbi decoding operate simultaneously. And waking up the equipment when the confidence coefficient of the wake-up word appearing in a sliding window meets a first preset condition in the sliding window decoding operation. At the same time, the sliding window decoding operation and the viterbi decoding operation are terminated. If the confidence levels obtained by the sliding window decoding operation do not meet the first preset condition, the Viterbi decoding operation is continued to be utilized to wake up the device by utilizing the Viterbi decoding operation.

That is, when the device is woken up by the sliding window decoding operation, the sliding window decoding operation and the viterbi decoding operation are not performed any more; if the equipment is not awakened by utilizing the sliding window decoding operation, continuously utilizing the Viterbi decoding operation to judge whether the predicted text sequence comprises the awakening word or not so as to awaken the equipment, thereby improving the awakening rate. In addition, since the decoding speed of the sliding window decoding operation is greater than that of the viterbi decoding operation, the sliding window decoding operation is preferentially utilized for decoding and waking up, so that the waking rate is improved under the condition that the decoding speed is not influenced.

The extraction of the text sequence from the speech signal to be processed may be performed using a continuous integrated issuance (Continuous Integrate-and-Fire, CIF) model. Specifically, the CIF model is utilized to integrate the acquired acoustic coding representations and send out the recognized words, so that whether the user wakes up or not is judged based on the word level, decoding overhead is reduced, and decoding speed is improved. The CIF is a neuron model using nerve pulse as output, and integrates acoustic information coming sequentially until a certain threshold is reached after input is weighted and summed up according to an exponential law, and when the integrated information quantity reaches a recognition threshold, the integrated information is issued for subsequent recognition. Specifically, the CIF is applied to the codec frame, and at each encoding instant, the CIF receives the encoded acoustic coded representation and its corresponding weights (characterizing the amount of information involved) respectively. CIF constantly accumulates weights and integrates the acoustically encoded representation (form of weighted summation). When the accumulated weight reaches a threshold, it means that an acoustic boundary is located.

In order to facilitate understanding of the technical scheme provided by the application, the voice wake-up method provided by the embodiment of the application will be described below with reference to the accompanying drawings.

Referring to fig. 1, the flowchart of a voice wake-up method provided by an embodiment of the present application, as shown in fig. 2, the method may include:

s101: and acquiring a voice signal to be processed, and extracting a text sequence from the voice signal to be processed.

In this embodiment, for a device with a voice wake-up capability, it may monitor in real time whether a voice signal (to-be-processed voice signal) sent by a user includes a specific wake-up word, and when the user speaks the specific wake-up word, the device will be woken up and switch to a working state to wait for a next instruction from the user. The voice signal to be processed is a voice signal sent by a user. And extracting the included text sequence from the voice signal to be processed by using a natural language processing technology in the process of acquiring the voice signal to be processed.

S102: and carrying out sliding window decoding operation on a decoding diagram formed by the character sequences according to the length of the wake-up words, and determining the confidence coefficient of the wake-up words in each sliding window.

After the text sequence included in the voice signal to be processed is obtained, whether the equipment is awakened or not is determined according to the text sequence and the awakening word. The wake-up word is used for waking up the device, and the length of the wake-up word can be set according to actual conditions. Specifically, after the text sequence is acquired, a decoder is utilized to decode the text sequence to obtain a decoding diagram, wherein the length of the decoding diagram is the length of the text sequence. That is, the length of the decoding diagram in this embodiment is the length of the text, and the decoding path length is smaller than that of the decoding path length when the wake-up is performed based on the speech frame, thereby improving the decoding speed.

And performing Viterbi decoding on the search graph formed by the character sequence while performing decoding operation on the decoding graph formed by the character sequence to obtain the confidence level on the Viterbi path. I.e. two different decoding operations are performed simultaneously on the text sequence.

S103: and waking up the equipment when the confidence coefficient of the wake-up word appearing in the sliding window meets a first preset condition.

In this embodiment, each time a sliding window operation is performed, a confidence level of the occurrence of a wake-up word in the sliding window is obtained, and whether the confidence level meets a first preset condition is determined, if yes, the device is awakened, and S104 is performed, so as to terminate the viterbi decoding operation. The first preset condition may be that the device is configured according to an actual application situation, for example, if the first preset condition is that the confidence coefficient of the wake-up word is greater than or equal to a first preset confidence coefficient threshold, when the confidence coefficient of the wake-up word appearing in the ith sliding window meets the first preset condition, wake-up the device includes: and waking up the equipment when the confidence coefficient of the wake-up word appearing in the ith sliding window is greater than or equal to a first preset confidence coefficient threshold value. Wherein i is a positive integer of 1 or more and N or less, and N is the number of sliding times.

If the confidence coefficient of the wake-up word appearing in all the sliding window operations does not meet the first preset condition, the Viterbi decoding operation is continuously carried out on the search graph formed by the text sequences, and the confidence coefficient on the Viterbi path is obtained. And if the confidence degree on the Viterbi path meets a second preset condition, waking up the equipment. The second preset condition may be a device according to an actual application situation. For example, the second preset condition is that the confidence level on the viterbi path is greater than or equal to a second preset confidence threshold. Wherein the second preset confidence threshold is less than the first preset confidence threshold.

S104: the sliding window decoding operation and the viterbi decoding operation are terminated.

In this embodiment, when the confidence level of the wake-up word appearing in a sliding window operation at a certain time meets a first preset condition, the device is waken up, and the sliding window decoding operation and the viterbi decoding operation are terminated.

It can be seen that, according to the embodiment of the application, for the acquired voice signal to be processed, the text sequence is extracted from the voice signal to be processed. After the character sequence is acquired, sliding windows are carried out on a decoding diagram formed by the character sequence according to the length of the wake-up word, and the confidence level of the wake-up word appearing in each sliding window is determined. And meanwhile, decoding a search graph formed by the text sequence by utilizing Viterbi decoding to obtain the confidence of the Viterbi path. When the confidence coefficient of the wake-up word appearing in a certain sliding window in the sliding window operation meets a first preset condition, waking up the equipment, and terminating the sliding window operation and the Viterbi decoding operation. That is, the embodiment of the application performs sliding window decoding and viterbi decoding on the extracted text sequence at the same time, preferentially uses the confidence coefficient obtained by the sliding window decoding to judge whether to wake up the equipment, and if the confidence coefficient of the sliding window decoding meets the first preset condition, wakes up the equipment, and terminates the sliding window decoding and the viterbi decoding. If the confidence coefficient of sliding window decoding does not meet the first preset condition, continuing to judge whether to wake up by utilizing the confidence coefficient obtained by Viterbi decoding. Therefore, the above operation does not affect the decoding speed of sliding window decoding, and the recall rate can be further improved due to the increase of viterbi decoding.

Referring to fig. 2, the flowchart of another voice wake-up method provided by the embodiment of the present application, as shown in fig. 2, the method may include:

s201: and acquiring a voice signal to be processed, and extracting a text sequence from the voice signal to be processed.

In this embodiment, for a device with voice wake-up capability, after the device is turned on and automatically loads resources, the device is in a sleep state. In the sleep state, the device monitors in real time whether a voice signal (to-be-processed voice signal) sent by a user comprises a specific wake-up word, and when the user speaks the specific wake-up word, the device is woken up and is switched to the working state to wait for a next instruction of the user. The voice signal to be processed is a voice signal sent by a user.

The text sequence can be extracted from the voice signal to be processed by the following ways:

1) And acquiring the voice characteristics to be processed from the voice signals to be processed, and encoding the voice characteristics to be processed to obtain acoustic encoding representation.

After the device collects the voice signals to be processed, the voice features to be processed are obtained from the voice signals to be processed, and the voice features to be processed are encoded to obtain acoustic encoding representation. Specifically, since the voice signal is a quasi-stationary signal, the voice signal can be first framed during processing, and each frame has a length of about 20ms to 30ms, and the voice signal is regarded as a stationary signal in this interval. Only steady state information can be signal processed. After framing the voice signal, carrying out wavelet transformation and processing on each frame to obtain the voice characteristics corresponding to each voice frame. After the speech feature of each speech frame is obtained, the speech feature is encoded to obtain an acoustically encoded representation.

2) And integrating according to each acoustic coding representation and the corresponding weight of the acoustic coding representation to obtain a text sequence.

After the acoustic coding representation corresponding to each voice feature is obtained, the text sequence is obtained according to each acoustic coding representation and the weight corresponding to the acoustic coding representation. The integration of each acoustic coding representation and the corresponding weight of the acoustic coding representation can be realized by using the CIF model, so as to issue a text sequence included in the voice signal to be processed. Specifically, integrating according to each acoustic coding representation and the corresponding weight of the acoustic coding representation to obtain a target acoustic coding representation; and when the weight corresponding to the acoustic coding representation meets the preset condition, acquiring a text sequence included in the voice signal to be processed according to the target acoustic coding representation. The acoustic coding indicates that the corresponding weight meets the preset condition, and the weight is equal to a preset threshold, wherein the preset threshold can be set according to the actual application situation.

For example, the acquired speech signal to be processed includes 100 frames, the speech features corresponding to the 100 frames of speech signals are extracted, the 100 frames of good speech features are encoded, the acoustic coding representations corresponding to the 100 frames of speech signals are obtained, the acoustic coding representations are input into a CIF model, and the CIF model accumulates weights and integrates the acoustic coding representations (in the form of weighted summation). When the accumulated weight reaches a threshold value, the CIF model outputs 10 characters included in the voice signal to be processed.

S202: and carrying out sliding window decoding operation on a decoding diagram formed by the character sequences according to the length of the wake-up words, and determining the confidence coefficient of the wake-up words in each sliding window.

After the text sequence included in the voice signal to be processed is obtained, whether the equipment is awakened or not is determined according to the text sequence and the awakening word. The wake-up word is used for waking up the device, and the length of the wake-up word can be set according to actual conditions.

Specifically, after the text sequence is acquired, a decoder is utilized to decode the text sequence to obtain a decoding diagram, wherein the length of the decoding diagram is the length of the text sequence. That is, the length of the decoding diagram in this embodiment is the length of the text, and the decoding path length is smaller than that of the decoding path length when the wake-up is performed based on the speech frame, thereby improving the decoding speed. Wherein, in order to reduce the parameter number and the calculation amount and improve the calculation speed, the decoder can be a non-autoregressive decoder. Specifically, the decoding diagram is a matrix in m×k dimensions, where M is the length of the text sequence and K is the length of the vocabulary. The vocabulary includes common words. For example, the decoding diagram shown in fig. 3a, in which the vocabulary includes 20 common words, and the word sequence length is 10, the data in the matrix represents the posterior probability that the predicted word is a certain common word. Wherein w1-w20 are common words, q1-q10 are predicted 10 words, wherein each row represents the probability that predicted q1 is each word, and the sum of the posterior probabilities of each row is 1.

After the decoding diagram corresponding to the text sequence is obtained, the length of the wake-up word can be used as the size of the sliding window to slide on the decoding diagram, so that the confidence that the wake-up word appears in the sliding window corresponding to each sliding is determined. The confidence level of the wake-up word in each sliding window can be obtained by specifically obtaining posterior probability corresponding to each character in the sliding window according to any sliding window operation, and multiplying the posterior probability of each character to obtain the confidence level of the wake-up word in each sliding window. For example, as shown in fig. 3b, with a wake word length of 4, each sliding window includes 4 predicted words, and the posterior probability of the 4 predicted words within each sliding window being wake words is determined. If the 1 st sliding window, the probability that q1 is the first word in the wake-up word is p15, the probability that q2 is the second word in the wake-up word is p22, the probability that q3 is the third word in the wake-up word is p37, and the probability that q4 is the fourth word in the wake-up word is p48, the posterior probability corresponding to the 1 st sliding window is p15 x p22 x p37 x p48. Similarly, if the probability that q2 is the first word in the wake-up word is p25, the probability that q3 is the second word in the wake-up word is p32, the probability that q4 is the third word in the wake-up word is p47, the probability that q5 is the fourth word in the wake-up word is p58, the posterior probability corresponding to the 1 st sliding window is p25×p32×p47×p58. And sliding sequentially to obtain the confidence coefficient of each sliding window.

S203: and carrying out Viterbi decoding on the search graph formed by the text sequence to obtain the confidence level on the Viterbi path.

In this embodiment, the decoding operation is performed on the decoding map formed by the text sequence by using sliding window decoding, and the decoding operation is performed on the search map formed by the text sequence by using viterbi decoding, both of which are performed in parallel. The viterbi decoding operation is to perform viterbi decoding on a search chart formed by the text sequence to obtain a confidence level on a viterbi path. The search graph is a matrix of n×m, where N is the length of the wake-up word plus 1, i.e., the total number of words included in the wake-up word plus 1. For example, if the wake-up word is 4 words, N is 5.M is the length of the text sequence, i.e. the total number of words included in the text sequence. For example, 7 words are extracted from the speech signal to be processed, and then M is 7. The first line of the search graph corresponds to a non-wake-up word, and the second line to the N line of the search graph respectively correspond to one word of the wake-up word.

As shown in fig. 4a, the wake-up word includes 4 words, which are "XYXY", and the word sequence includes 7 words, for example, the first behavior of the search graph is other node, the second behavior X, the third behavior Y, the fourth behavior X, and the fifth behavior Y. Wherein the other node represents other words used after the start of the wake-up word. Usually, the other represents the highest probability in other words after the wake-up word is removed in the words predicted by the current node, so that if the words included in the wake-up word are actually predicted, the probability of the other is very low, and if the words included in the wake-up word are not actually predicted, the probability of the other is very high. Thus, the presence of the other does not affect the confidence of the viterbi path.

In decoding the search map, only decoding is allowed downward or rightward. The downward decoding is to accurately identify the wake-up words included in the text sequence. The right decoding is to solve the following problems, firstly, for a user to drag a long voice, there may be an "XYYXY" situation, and two Y between can be combined by right decoding so as not to affect the probability of "Y"; secondly, if the user says wrong, for example, "XYZXY", the probability of determining "Z" as Y is small, and the confidence of the viterbi path is small, and the user determines that the recall is not to be performed by the threshold value.

Specifically, when viterbi decoding is performed on a search graph formed by a text sequence to obtain confidence on a viterbi path, viterbi decoding may be performed right or downward from a first text in the text sequence, so as to obtain the viterbi path; and obtaining posterior probabilities corresponding to the characters on the Viterbi path, and multiplying the posterior probabilities corresponding to the characters to obtain the confidence of the Viterbi path. As shown in fig. 4b, taking the acquired text sequence as "open XYXY" as an example, the viterbi path is a1-a2-a3-a4-a5, and the probability of "open" two words as other is higher, which is p1 and p2 respectively; the probability of the third word being X is P3, the probability of the fourth word being Y is P4, the probability of the fifth word being X is P5, and the probability of the sixth word being Y is P6, the confidence of the viterbi path p=p1×p2×p3×p4×p5×p6.

S204: and waking up the equipment when the confidence coefficient of the wake-up word appearing in the ith sliding window is greater than or equal to a first preset confidence coefficient threshold value.

The two decoding operations are executed in parallel without mutual influence, and as the speed of sliding window decoding is far higher than that of Viterbi decoding, the confidence obtained by the sliding window decoding operation is preferentially utilized to judge whether to wake up the equipment. If the confidence coefficient of the wake-up word appearing in a sliding window is larger than or equal to a first preset confidence coefficient threshold value, which indicates that the to-be-processed voice signal comprises the wake-up word, the equipment is waken, and meanwhile S105 is executed, decoding operation is not performed any more, and the calculated amount is reduced. Wherein i is greater than or equal to 1 and less than or equal to N, N is the total sliding window sliding times, N=M-L+1, and L is the length of the wake-up word. For example, if the length of the text sequence is 10 and the length of the wake-up word is 4, the number of slides is 7. And if the confidence coefficient of the 3 rd sliding window is larger than or equal to a first preset confidence coefficient threshold value, waking up the equipment, otherwise, continuing sliding the window until the sliding window is slid for N times.

S205: the sliding window decoding operation and the viterbi decoding operation are terminated.

S206: and waking up the equipment when the confidence coefficient of the wake-up word appearing in the N sliding windows is smaller than a first preset confidence coefficient threshold value and the confidence coefficient on the Viterbi path is larger than or equal to a second preset confidence coefficient threshold value.

In this embodiment, since the speed of the sliding window decoding is higher than the speed of the viterbi decoding, when the apparatus has not been awakened after the sliding window decoding is completed, the viterbi decoding operation is continued, and after the viterbi decoding operation is completed, the confidence on the viterbi path is obtained. And when the confidence coefficient on the Viterbi path is greater than or equal to a second preset confidence coefficient threshold value, waking up the equipment. Wherein the second preset confidence level is less than the first preset confidence level, i.e., the viterbi decoding operation may identify positive samples of low confidence level that are excluded by the sliding window decoding operation.

Therefore, by the decoding method provided by the embodiment of the application, when the equipment cannot be awakened through the sliding window decoding operation, the viterbi decoding operation can be used for further carrying out the awakening judgment, so that the decoding speed with high confidence coefficient is not affected, and the awakening rate of the positive sample (the sample not identified by the sliding window decoding operation) with low confidence coefficient can be improved.

Based on the above method embodiments, the embodiments of the present application provide a voice wake-up device, which will be described with reference to the accompanying drawings.

Referring to fig. 5, the structure diagram of a voice wake-up device provided by the embodiment of the present application, as shown in fig. 5, the device may include:

An obtaining unit 501, configured to obtain a to-be-processed voice signal, and extract a text sequence from the to-be-processed voice signal;

a determining unit 502, configured to perform sliding windows on a decoding diagram formed by the text sequence according to a length of a wake-up word, determine a confidence level of the wake-up word occurring in each sliding window, where the wake-up word is used for waking up a device;

a wake-up unit 503, configured to wake up the device when the confidence level of the wake-up word appearing in the sliding window meets a first preset condition;

and a termination unit 504, configured to terminate a sliding window operation and a viterbi decoding operation, where the viterbi decoding operation is used to decode the text sequence according to the wake-up word.

In a possible implementation manner, the wake-up unit 503 is specifically configured to wake up the device when the confidence level of the wake-up word appearing in the ith sliding window is greater than or equal to a first preset confidence threshold, where i is a positive integer greater than or equal to 1 and less than or equal to N, and N is the number of sliding times.

In a possible implementation manner, the obtaining unit 501 is further configured to, when the confidence coefficient of the wake-up word appearing in the N sliding windows does not meet the first preset condition, continue to perform viterbi decoding on the search graph formed by the text sequence, to obtain the confidence coefficient on the viterbi path, where N is the number of sliding times;

The wake-up unit 503 is further configured to wake up the device when the confidence level on the viterbi path meets a second preset condition.

In one possible implementation manner, the first preset condition is that the confidence level of the wake-up word is greater than or equal to a first preset confidence level threshold; the second preset condition is that the confidence degree on the Viterbi path is larger than or equal to a second preset confidence degree threshold value, and the second preset confidence degree threshold value is smaller than the first preset confidence degree threshold value.

In a possible implementation manner, the obtaining unit 501 is specifically configured to perform viterbi decoding from a first word in the word sequence to the right or downward to obtain a viterbi path; and obtaining posterior probabilities corresponding to all characters on the Viterbi path, and multiplying the posterior probabilities corresponding to all the characters to obtain the confidence of the Viterbi path.

In one possible implementation, the search graph is a matrix of n×m, where N is a total number of words included in the wake-up word plus 1, and M is a total number of words included in the word sequence.

In one possible implementation manner, the first line of the search graph corresponds to a non-wake word, and the second line to the nth line of the search graph respectively correspond to one word of the wake words.

In one possible implementation, the decoding graph is a matrix of m×k, where K is a length of a vocabulary.

In a possible implementation manner, the determining unit 502 is specifically configured to obtain, for any sliding window operation, posterior probabilities corresponding to each word in a sliding window, and multiply the posterior probabilities corresponding to each word to obtain a confidence level of occurrence of the wake-up word in each sliding window.

In a possible implementation manner, the obtaining unit 501 is specifically configured to obtain a to-be-processed voice feature from the to-be-processed voice signal, and encode the to-be-processed voice feature to obtain an acoustic encoded representation; and integrating according to each acoustic coding representation and the corresponding weight of the acoustic coding representation to obtain a text sequence.

It should be noted that, in this embodiment, the implementation of each unit may be described in the method embodiment described in fig. 1 or fig. 2, and this embodiment is not repeated here.

Referring now to fig. 6, a schematic diagram of an electronic device 1300 suitable for use in implementing embodiments of the present application is shown. The terminal device in the embodiment of the present application may include, but is not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (Personal Digital Assistant, personal digital assistants), PADs (portable android device, tablet computers), PMPs (Portable Media Player, portable multimedia players), car terminals (e.g., car navigation terminals), and the like, and fixed terminals such as digital TVs (televisions), desktop computers, and the like. The electronic device shown in fig. 6 is only an example and should not be construed as limiting the functionality and scope of use of the embodiments of the application.

As shown in fig. 6, the electronic device 1300 may include a processing means (e.g., a central processor, a graphics processor, etc.) 1301, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1302 or a program loaded from a storage means 1306 into a Random Access Memory (RAM) 1303. In the RAM1303, various programs and data necessary for the operation of the electronic apparatus 1300 are also stored. The processing device 1301, the ROM 1302, and the RAM1303 are connected to each other through a bus 1304. An input/output (I/O) interface 1305 is also connected to bus 1304.

In general, the following devices may be connected to the I/O interface 1305: input devices 1306 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, and the like; an output device 1307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 1306 including, for example, magnetic tape, hard disk, etc.; and communication means 1309. The communication means 1309 may allow the electronic device 1300 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 shows an electronic device 1300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead.

In particular, according to embodiments of the present application, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a non-transitory computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication device 1309, or installed from the storage device 1306, or installed from the ROM 1302. When being executed by the processing means 1301, performs the above-described functions defined in the method of an embodiment of the present application.

The electronic device provided by the embodiment of the present application belongs to the same inventive concept as the training method and the image restoration method of the image restoration model provided by the above embodiment, and technical details not described in detail in the present embodiment can be referred to the above embodiment, and the present embodiment has the same beneficial effects as the above embodiment.

An embodiment of the present application provides a computer readable medium having a computer program stored thereon, wherein the program when executed by a processor implements a voice wake-up method as described in any of the above embodiments.

The computer readable medium of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

In some implementations, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol ), and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the internet (e.g., the internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed networks.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the training method or the image restoration method of the image restoration model.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including, but not limited to, an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present application may be implemented in software or in hardware. The name of the unit/module is not limited to the unit itself in some cases, and, for example, the voice data acquisition module may also be described as a "data acquisition module".

The functions described above herein may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: a Field Programmable Gate Array (FPGA), an Application Specific Integrated Circuit (ASIC), an Application Specific Standard Product (ASSP), a system on a chip (SOC), a Complex Programmable Logic Device (CPLD), and the like.

In the context of the present application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, a voice wake-up method is provided, a voice signal to be processed is obtained, and a text sequence is extracted from the voice signal to be processed;

According to one or more embodiments of the present disclosure, when the confidence level of the wake-up word appearing in the sliding window meets a first preset condition, waking up the device includes:

and waking up the equipment when the confidence coefficient of the wake-up word appears in the ith sliding window is greater than or equal to a first preset confidence coefficient threshold value, wherein i is a positive integer which is greater than or equal to 1 and less than or equal to N, and N is the sliding times.

According to one or more embodiments of the present disclosure, the method further comprises:

When the confidence coefficient of the wake-up word appearing in the N sliding windows does not meet a first preset condition, continuously carrying out Viterbi decoding on a search graph formed by the text sequence to obtain the confidence coefficient on a Viterbi path, wherein N is the sliding times;

and waking up the equipment when the confidence degree on the Viterbi path meets a second preset condition.

According to one or more embodiments of the present disclosure, the first preset condition is that the confidence level of the wake word is greater than or equal to a first preset confidence threshold; the second preset condition is that the confidence degree on the Viterbi path is larger than or equal to a second preset confidence degree threshold value, and the second preset confidence degree threshold value is smaller than the first preset confidence degree threshold value.

According to one or more embodiments of the present disclosure, the viterbi decoding the search graph formed by the text sequence to obtain the confidence level on the viterbi path includes:

performing Viterbi decoding right or downward from the first word in the word sequence to obtain a Viterbi path;

and obtaining posterior probabilities corresponding to all characters on the Viterbi path, and multiplying the posterior probabilities corresponding to all the characters to obtain the confidence of the Viterbi path.

According to one or more embodiments of the present disclosure, the search graph is a matrix of n×m, where N is a total number of words included in the wake-up word plus 1, and M is a total number of words included in the word sequence.

According to one or more embodiments of the present disclosure, the first line of the search graph corresponds to a non-wake word, and the second line to the nth line of the search graph respectively correspond to one word of the wake words.

According to one or more embodiments of the present disclosure, the decoding graph is a matrix of m×k, where K is a length of a vocabulary.

According to one or more embodiments of the present disclosure, the sliding window is performed on the decoding graph formed by the text sequence with the length of the wake-up word, and the determining the confidence level of the wake-up word appearing in each sliding window includes:

and aiming at any sliding window operation, acquiring posterior probabilities corresponding to all characters in the sliding window, and multiplying the posterior probabilities corresponding to all the characters to obtain the confidence coefficient of the wake-up word in each sliding window.

According to one or more embodiments of the present disclosure, the extracting the text sequence from the speech signal to be processed includes:

acquiring a voice feature to be processed from the voice signal to be processed, and encoding the voice feature to be processed to obtain an acoustic encoding representation;

And integrating according to each acoustic coding representation and the corresponding weight of the acoustic coding representation to obtain a text sequence.

According to one or more embodiments of the present disclosure, there is provided a voice wake apparatus, the apparatus comprising:

According to one or more embodiments of the present disclosure, the wake-up unit is specifically configured to wake up the device when the confidence level of the wake-up word appearing in the ith sliding window is greater than or equal to a first preset confidence threshold, where i is a positive integer greater than or equal to 1 and less than or equal to N, and N is the number of sliding times.

According to one or more embodiments of the present disclosure, the obtaining unit is further configured to continuously perform viterbi decoding on a search graph formed by the text sequence when the confidence coefficient of the wake-up word appearing in the N sliding windows does not meet the first preset condition, to obtain the confidence coefficient on the viterbi path, where N is the number of sliding times;

the wake-up unit is further configured to wake up the device when the confidence level on the viterbi path meets a second preset condition.

According to one or more embodiments of the present disclosure, the obtaining unit is specifically configured to perform viterbi decoding from a first word in the word sequence to the right or downward to obtain a viterbi path; and obtaining posterior probabilities corresponding to all characters on the Viterbi path, and multiplying the posterior probabilities corresponding to all the characters to obtain the confidence of the Viterbi path.

According to one or more embodiments of the present disclosure, the determining unit is specifically configured to obtain, for any sliding window operation, posterior probabilities corresponding to each word in a sliding window, and multiply the posterior probabilities corresponding to each word to obtain a confidence level of occurrence of the wake-up word in each sliding window.

According to one or more embodiments of the present disclosure, the obtaining unit is specifically configured to obtain a to-be-processed speech feature from the to-be-processed speech signal, and encode the to-be-processed speech feature to obtain an acoustic encoded representation; and integrating according to each acoustic coding representation and the corresponding weight of the acoustic coding representation to obtain a text sequence.

It should be noted that, in the present description, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different manner from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other. For the system or device disclosed in the embodiments, since it corresponds to the method disclosed in the embodiments, the description is relatively simple, and the relevant points refer to the description of the method section.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of waking up speech, the method comprising:

when the confidence coefficient of the wake-up word appearing in the sliding window meets a first preset condition, waking up the equipment; when the confidence coefficient of the wake-up word in the N sliding windows does not meet a first preset condition, continuously carrying out Viterbi decoding on a search graph formed by the text sequence to obtain the confidence coefficient on a Viterbi path, and when the confidence coefficient on the Viterbi path meets a second preset condition, waking up the equipment, wherein N is the sliding times;

2. The method of claim 1, wherein waking up the device when the confidence level of the wake-up word occurring in the sliding window meets a first preset condition comprises:

3. The method of claim 1, wherein the first preset condition is that a confidence level of the wake word is greater than or equal to a first preset confidence threshold; the second preset condition is that the confidence degree on the Viterbi path is larger than or equal to a second preset confidence degree threshold value, and the second preset confidence degree threshold value is smaller than the first preset confidence degree threshold value.

4. The method of claim 1, wherein the performing viterbi decoding on the search graph formed by the text sequence to obtain the confidence level on the viterbi path comprises:

5. The method of claim 1, wherein the search graph is a matrix of N x M, N being the total number of words included in the wake-up word plus 1, M being the total number of words included in the word sequence.

6. The method of claim 5, wherein a first line of the search graph corresponds to a non-wake word, and a second line to an nth line of the search graph each correspond to one of the wake words.

7. The method of claim 1, wherein the decoding graph is a matrix of M x K, and K is a length of a vocabulary.

8. The method of claim 1, wherein sliding a window over the decoded picture formed by the sequence of words with a length of the wake-up word, determining a confidence level for each occurrence of the wake-up word within the sliding window, comprises:

9. The method of claim 1, wherein the extracting the text sequence from the speech signal to be processed comprises:

10. A voice wakeup apparatus, the apparatus comprising:

the wake-up unit is used for waking up the equipment when the confidence coefficient of the wake-up word appearing in the sliding window meets a first preset condition; when the confidence coefficient of the wake-up word in the N sliding windows does not meet a first preset condition, continuously carrying out Viterbi decoding on a search graph formed by the text sequence to obtain the confidence coefficient on a Viterbi path, and when the confidence coefficient on the Viterbi path meets a second preset condition, waking up the equipment, wherein N is the sliding times;

11. An electronic device, the device comprising: a processor and a memory;

the memory is used for storing instructions or computer programs;

the processor for executing the instructions or computer program in the memory to cause the electronic device to perform the method of any of claims 1-9.

12. A computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-9.