CN110364143B

CN110364143B - Voice awakening method and device and intelligent electronic equipment

Info

Publication number: CN110364143B
Application number: CN201910747867.6A
Authority: CN
Inventors: 苏丹; 陈杰; 王珺; 俞栋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2022-01-28
Anticipated expiration: 2039-08-14
Also published as: CN110364143A

Abstract

A voice wake-up method and device based on artificial intelligence and an intelligent electronic device thereof are disclosed. The voice wake-up method comprises the following steps: acquiring an audio feature set of voice data; detecting a voice wake-up keyword based on the audio feature set; and under the condition that the voice awakening keyword is detected, utilizing a two-classification network to carry out awakening judgment on the audio feature set.

Description

Voice awakening method and device and intelligent electronic equipment

Technical Field

The present disclosure relates to the field of speech recognition, and more particularly, to a method and an apparatus for voice wake-up based on artificial intelligence, and an intelligent electronic device thereof.

Background

The voice wakeup refers to that a user interacts with the electronic equipment through voice and realizes the transition of the electronic equipment from the dormant state to the active state. At present, in low-cost electronic devices, a relatively simple wake-up detection network is often adopted, and the false wake-up rate is relatively high. On the other hand, in order to provide higher wake-up detection accuracy, a complex wake-up detection network is required, which puts higher requirements on the computing power of the electronic device and cannot be generally used in various electronic devices.

Disclosure of Invention

The embodiment of the disclosure provides a voice awakening method and device based on artificial intelligence and intelligent electronic equipment thereof.

The embodiment of the present disclosure provides a voice wake-up method based on artificial intelligence, which includes: acquiring an audio feature set of voice data; detecting a voice awakening keyword based on the audio feature set; and under the condition that the voice awakening keyword is detected, utilizing a two-classification network to carry out awakening judgment on the audio feature set.

An embodiment of the present disclosure further provides a voice wake-up apparatus, which includes: the voice data extraction module is used for acquiring an audio feature set of the voice data; the first processing module is used for detecting voice awakening keywords based on the audio feature set; and the second processing module is used for carrying out awakening judgment on the audio feature set by utilizing a two-class network under the condition that the voice awakening keyword is detected.

Embodiments of the present disclosure also provide a computer-readable storage medium on which a computer program is stored, wherein the program, when executed by a processor, implements the steps in the method described above.

An embodiment of the present disclosure also provides an intelligent electronic device, including: the voice acquisition unit is used for acquiring voice data; a processor; a memory having stored thereon computer instructions which, when executed by the processor, implement the above-described method.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings used in the description of the embodiments will be briefly introduced below. The drawings in the following description are merely exemplary embodiments of the disclosure.

Fig. 1 is a schematic diagram illustrating a voice wake-up scenario according to an embodiment of the present disclosure.

Fig. 2 is a flowchart illustrating a voice wake-up method according to an embodiment of the present disclosure.

Fig. 3 is a schematic diagram illustrating a voice wake-up method according to an embodiment of the present disclosure.

Fig. 4 is yet another schematic diagram illustrating a voice wake-up method according to an embodiment of the present disclosure.

Fig. 5 is a schematic diagram illustrating a voice wake-up apparatus according to an embodiment of the present disclosure.

Fig. 6 is a block diagram illustrating an intelligent electronic device according to an embodiment of the present disclosure.

Fig. 7 is a schematic diagram illustrating a terminal dual model system for voice wake-up.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more apparent, example embodiments according to the present disclosure will be described in detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

In the present specification and the drawings, steps and elements having substantially the same or similar characteristics are denoted by the same or similar reference numerals, and repeated description of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first", "second", and the like are used only for distinguishing the description, and are not to be construed as indicating or implying relative importance or order.

For the purpose of describing the present disclosure, concepts related to the present disclosure are introduced below.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future. Currently, automatic speech recognition technology has been widely used in various fields. Voice wake-up detection technology, one of the branches of automatic voice recognition technology, has also been widely applied to various intelligent electronic devices as one of the common ways to wake up these intelligent electronic devices.

Fig. 1 is a schematic diagram illustrating a scenario 100 of voice wake detection according to an embodiment of the present disclosure.

Referring to FIG. 1, in a scene 100, both user A and user B interact with a smart device 101 by speaking into the smart device 101 through speech.

The smart device 101 may be any smart device, such as a smart electronic device (e.g., a smart speaker, a smart television, a smart gateway, etc.), a smart phone, a smart car device, and so on. The smart device 101 may also be a voice assistant apparatus, voice assistant software, or the like that can be installed in the above-described devices. When recognizing that the user utters the correct voice wakeup keyword, the smart device 101 may perform various operations according to the content of the voice wakeup keyword. For example, when the user says the correct voice wake up keyword (e.g., user a in fig. 1 says "jingle"), the smart device 101 may recognize that the user said the correct voice wake up keyword and activate from the sleep state to the run state. When the user says the wrong voice wake-up keyword (e.g., user B in fig. 1 says "see again"), the smart device 101 continues to remain in the sleep state.

It is generally desirable to implement the above-described scenario 100 using voice wake detection techniques. Wake-on-speech detection (also called keyword spotting (KWS)) technology refers to detecting whether a piece of speech data includes a specific speech segment. Typically, this particular piece of speech includes a voice wake-up keyword, such as "jingle" in fig. 1.

Various systems capable of realizing the voice awakening technology comprise a Deep keyword System (Deep Kws System), a keyword/filter hidden markov model System, a terminal dual-model System, a cloud secondary model System and the like. There are respective drawbacks in practical industrial applications.

For example, the deep keyword system is of a single model structure and employs a deep neural network to achieve balanced wake performance. Because the system only uses a single model structure, the performance of the system can hardly reach a sufficient recognition rate in complex application scenes such as a far field and noise.

Fig. 7 shows a schematic diagram of a terminal dual-model system 700 for voice wake-up. The terminal dual-model system shown in fig. 7 utilizes two complex neural networks to perform a large amount of computation to obtain a relatively accurate wake-up result. The terminal dual model system includes a low computation module 701 and a precise computation module 702. The low computation module 701 includes a MFCC feature computation module, a feature caching module, a small deep neural network (small DNN) module, a first hidden markov score (first HMM score) module. The small deep neural network module is used for preliminarily judging whether the input voice is related to the voice awakening keyword or not and outputting a first association probability. The first hidden Markov score module determines a first confidence level based on the first association probability. The precision calculation module 702 includes a large deep neural network (large DNN) module and a second hidden markov score (second HMM score) module. After the low computation module 701 detects that the user has spoken the voice wake-up keyword, the feature data in the feature cache module is input to the large deep neural network module in the precise computation module 702. The large deep neural network module judges whether the input voice is related to the voice awakening keyword again, and outputs a second association probability to the second hidden Markov scoring module to obtain a second confidence coefficient. Because the terminal dual-model system 700 uses two complex neural networks connected in series on the terminal, and the second-level neural network has a larger calculation amount than the first-level neural network, more calculation resources are needed, and the requirement on the intelligent electronic device is higher.

The cloud secondary model system also utilizes the two neural networks to carry out awakening judgment, and places the complex secondary neural network of the system at the cloud end in order to reduce the calculation amount of the terminal side. However, the system needs the network and the cloud for verification, so that the technical problem of response delay exists.

The present disclosure provides an improved artificial intelligence based voice wake-up method that can reduce the amount of computation, shorten the delay, and improve the accuracy of the response of the smart device by using a two-class network as a second-level neural network.

Fig. 2 is a flow chart illustrating a voice wake-up method 200 according to an embodiment of the present disclosure.

The voice wake-up method 200 according to the embodiment of the present disclosure may be applied to any intelligent device, and may also be executed in the cloud and then return the decision result to the device to be woken up. Next, the smart device 101 in fig. 1 will be described as an example.

First, in step S201, an audio feature set of voice data is acquired.

Specifically, the voice data may include sound captured in various forms and converted into sound data stored in the form of a digital file, for example, sound data periodically captured by a microphone of the smart device 101, or the like. The voice data may be cached in the memory of the smart device 101 for further analysis. The voice data may be encoded or stored in the. mp3,. wav,. voc, and. au formats, etc. The present disclosure does not impose any limitation on the format of the voice data.

Each element in the above-described audio feature set refers to audio feature data that can be extracted from speech data. In order to characterize and recognize speech data, it is generally necessary to analyze data such as sound frequency, volume, emotion, pitch, energy, etc. of the speech data. These data may be referred to as "audio feature data" of the voice data.

To facilitate analysis of the speech data, the audio feature data may further be obtained using various speech feature extraction models. The speech feature extraction model includes, but is not limited to, FBANK (also called FilterBank) or MFCC, etc. The audio feature data extracted by the FBANK voice feature extraction model is also called FBANK voice feature data. The present disclosure will be described with respect to FBANK voice feature data as an example, but the present disclosure is not so limited. The FBANK speech feature extraction model may extract audio features in a manner similar to the way the human ear processes the sounds it hears. The FBANK voice feature extraction model can obtain an array (also called FBANK feature vector) capable of characterizing each frame of voice data by performing operations such as fourier transform, energy spectrum calculation and Mel filtering on the framed voice data. The array is the FBANK audio characteristic data.

In step S202, a voice wake-up keyword is detected based on the audio feature set.

Specifically, whether the voice data includes the voice wake-up keyword may be detected by further analyzing each audio feature data in the audio feature set. The voice wake-up keyword may be any keyword preset by the user or a default keyword in the smart device 101, such as "jingle" in fig. 1. The voice characteristic data of the voice data including the voice wakeup keyword may be determined in advance. The audio feature data in the audio feature set is then compared to these predetermined speech feature data to determine whether the audio feature set matches the wake-on-speech keyword. For example, FBANK voice feature data of the phrase "jingle" may be predetermined and then compared with the audio feature set obtained in step S201, so as to determine whether a voice wake-up keyword is detected.

The step of detecting the voice wakeup keyword further comprises determining whether the sets of audio features match the voice wakeup keyword using a keyword detection network. The keyword detection network may be structured in various models, such as DNN, CNN, LSTM, or the like. The keyword detection network may utilize an acoustic model that utilizes phoneme tags to determine whether the set of audio features matches the voice wake keyword. The phoneme is a minimum unit of speech divided according to natural attributes of speech, and is determined according to a pronunciation action in a syllable. For example, the chinese syllable ā (a) includes one phoneme, and the ai (a) includes two phonemes. Since the voice wakeup keyword may be divided into a plurality of phonemes, a plurality of phoneme labels may be used to represent the voice characteristics of the voice wakeup keyword. The keyword detection network system may sequentially calculate the association probability of each audio data feature in the audio feature set compared with the voice wake keyword phoneme label. And summarizing and counting the association probabilities to obtain the confidence coefficient of the voice data including the voice awakening keyword. A confidence level above a predetermined threshold indicates that a speech keyword has been detected.

Of course, the keyword detection network may also be other neural networks capable of recognizing the voice wake keyword, such as a hidden markov neural network (HMM), a gaussian mixture neural network (CMM), and the like.

In step S203, in the case that the voice wake-up keyword is detected, a wake-up decision is performed on the audio feature set by using a binary network.

Specifically, the above-mentioned binary classification network (which may also be referred to as a binary classification model) refers to a neural network that classifies inputs into two classes (i.e., outputs of 0 or 1). In case a voice wake-up keyword is detected, the above-mentioned two-class network is activated, so that a further decision is made in the set of audio features. The model parameters of the two-classification network are far smaller than those of the keyword detection network model, so that the calculation amount of the system can be reduced. The awakening judgment of the audio feature set by the classification network can be executed at the cloud end or the terminal, and the method is not limited by the disclosure.

More specifically, the above-described two-class network may include a plurality of layers: an input layer, at least one hidden layer, and an output layer. Each hidden layer includes a plurality of nodes. The nodes may be neurons (cells) or sensors, and each node may have multiple inputs. The output layer includes nodes equal to or less than 2. Each node may have a different weight and bias for any of its inputs. And the values of the weights and the offsets are trained through the sample data.

The two-class network may be a fully connected neural network. The fully connected neural network refers to: the nodes in two adjacent layers of the neural network are connected. For example, each node in the input layer is connected to each node of the hidden layer closest to the input layer. The nodes in adjacent hidden layers are also interconnected. Each node in the hidden layer closest to the output layer is also connected to two nodes of the output layer. The fully connected neural network can be used for analyzing the input audio characteristic data from more angles, so that a more accurate judgment result is obtained.

Specifically, the plurality of audio feature data in the audio feature set may be synthesized into representative audio feature data, and the fully-connected neural network is used to make a wake-up decision on the representative audio feature data. "representative audio feature data" means audio feature data that is capable of characterizing/representing the set of audio features. For example, the "representative audio feature data" may be audio feature data formed by splicing a predetermined number of audio feature data in the audio feature set in chronological order. The "representative audio feature data" may also be audio feature data extracted by performing other secondary processing on each element in the audio feature set. The present disclosure does not limit the specific form of "representative audio feature data" as long as the set of audio features can be characterized.

The data representing the audio features are input into an input layer of the fully-connected neural network, and through at least one hidden layer, an output layer can output a '0' indicating that the intelligent device is not awakened and a '1' indicating that the intelligent device is awakened. The output layer may output a real number greater than or equal to 0 and less than 1. And when the value of the real number is larger than a preset threshold value, the intelligent electronic equipment is awakened. Thus, the two-class network completes the wake-up decision on the audio feature set.

In case the voice data is decided to be awake, the smart device 101 may be woken up. For example, when the two-class network is located in the cloud, the cloud server may send a signal to the smart device 101 through the wired network and/or the wireless network to trigger the transition of the smart device 101 from the sleep state to the working state. When the two-class network is located at the smart device 101, the wake-up decision may directly activate the smart device 101 to transition from the sleep state to the active state. In the case where the voice data is decided not to be woken up, the smart device 101 may remain in a sleep state or do nothing.

Therefore, the voice awakening method 200 according to the embodiment of the present disclosure can effectively suppress most of false awakenings under the condition of a small model parameter number through a two-class network, thereby significantly reducing the calculation amount, shortening the delay, and improving the accuracy of the response of the smart device. Compared with a common voice wake-up technology which only uses a complex single-model neural network or uses a plurality of complex neural network models with the same framework, the voice wake-up method 200 can reach the level of industrial application in complex application scenes such as a far field and high noise, correctly wake up equipment under the condition of low delay, and improve the usability of the whole intelligent equipment.

Fig. 3 is a schematic diagram illustrating a voice wake-up method 200 according to an embodiment of the present disclosure.

As shown in fig. 3, acquiring the audio feature set 302 of the speech data 301 may include acquiring audio feature data of each frame of the speech data.

Specifically, referring to fig. 3, the voice data 301 may be divided into a plurality of frames at certain time intervals. Typically, the duration of the voice data containing the full voice wakeup keyword is 2 to 5 seconds. The voice data 301 may be divided into a plurality of frames every 10 milliseconds as one frame. In order to approach the way of processing the voice data by human ears, there may be an overlapping portion between the voice data of two adjacent frames. For example, the first frame voice data may be data of 0 ms to 10 ms of the voice data, and the second frame voice data may be data of 8 ms to 18 ms of the voice data.

Then, each frame of speech data can be processed to obtain audio feature data of each frame of the speech data (step (r) in fig. 3). For example, the FBANK audio feature data of each frame may be obtained using the FBANK model described above. Each frame of audio feature data may be an array of L dimensions, where L is greater than or equal to 1. Alternatively, L is equal to 13. The audio feature set 302 may include audio feature data for a plurality of consecutive frames.

With continued reference to fig. 3, each acquired frame of audio feature data may be buffered according to a predetermined buffering rule (step two of fig. 3). For example, each frame of audio feature data may be input into the buffer 303 in turn. Wherein the predetermined caching rules include, but are not limited to: according to a first-in first-out rule, caching audio characteristic data of a preset number of continuous frames; or buffering the audio feature data for a predetermined number of consecutive frames after detecting the predetermined phoneme label. Alternatively, the size of the buffer 303 can just cover the size needed to recognize the voice wake up keyword. For example, assuming that the recognition of the wake-on-speech keyword "jingle" requires approximately M frames of audio feature data, the size of the buffer 303 may be M x L bits.

The buffer 303 may sequentially input the audio feature data of the first frame to the nth frame to the keyword detection network 304 (step c in fig. 3). To obtain more accurate results, the keyword detection network 304 may be a complex deep neural network. In particular, as shown in FIG. 3, the keyword detection network 304 may include one or more hidden layers. Each hidden layer includes a plurality of neurons (cells), each of which may have a plurality of inputs. For example, the neuron inputs in the hidden layer closest to the input layer may be data of any dimension in the audio feature data of the L dimension. Each neuron has a weight and bias for each input. The values of the weights and offsets are trained over a large number of sample data. The keyword detection network 304 in fig. 3 is merely an example, which may also have other structures. The structure of the keyword detection network 304, the number of nodes in each layer, and the connection manner between the nodes are not limited by the present disclosure.

The keyword detection network 304 compares each frame of audio feature data in the buffer with the phoneme label of the voice wake-up keyword to determine the association probability between the frame of audio feature data and the phoneme label. The keyword detection network 304 may process audio feature data of one frame at a time, or may process audio feature data of multiple frames at a time. The following description will be given by taking an example in which audio feature data of one frame is processed at a time. The keyword detection network 304 may calculate the distance between the jth frame of audio feature data and the ith phoneme labelAssociated probability P_ijI.e. the association probability P_ij. Wherein i and j are integers not less than 0. For example, the keyword detection network 304, when processing the first frame of audio feature data of the wake-on-speech keyword "jingle", may compare the first phoneme label "x" of the wake-on-speech keyword with the first frame of audio feature data and output a probability P that the first phoneme label "x" is associated with the first frame of audio feature data₁₁。

Due to the probability of association P_ijIt usually contains noise and therefore can be smoothed with a smoothing window before the confidence level of the voice wake-up keyword is calculated. For example, the association probability P can be processed using the following equation (1)_ijTo obtain a smoothed association probability P_ij’。

In the formula (1), k is represented in h_smoothAnd any value between j, h_smoothIndicating the index/frame number of the first frame data in the smoothing window. h is_smoothThe calculation can be performed by the following equation (2):

h_smooth＝max{1，j-w_smooth+1} (2)

w is above_smoothRefers to the size of the smooth window. For example, when the size of the smoothing window is 6 frames, j is 10, and i is 9, the association probability P after smoothing is obtained_ij' is the average of the probabilities that the audio feature data of the 5 th frame to the 10 th frame are respectively associated with the 9 th phoneme label, when h_smoothEqual to 5. The smoothing process will reduce the noise between the key probabilities of consecutive frames, making the confidence more accurate.

The keyword detection network 304 may then correlate the smoothed probabilities P_ij' successively input to the confidence coefficient calculation window 305 one by one (step (r) in fig. 3), without calculating all the associated probabilities P at once_ij'. The keyword detection network 304 may calculate that the speech was detected at frame jConfidence of the wake word. Assume that the confidence computation window 305 has a window size w_max，w_maxGreater than 1. Specifically, the confidence calculation window 305 may utilize the following equation (3):

to calculate a confidence level that the voice wake up keyword is detected in the set of audio features 304. In the above formula (3), n represents the index of the currently calculated phoneme label. For example, assuming that the wake-on-speech keyword has 30 phoneme tags and the 25 th phoneme tag is currently being processed, then n is equal to 25. m represents h_maxAnd any value between j. h is_maxIndex/frame number, h, representing the first frame in the confidence calculation window_maxCan be obtained by the following equation (4):

h_max＝max{1，j-w_max+1} (4)

the confidence calculation window 305 generally outputs a small confidence in the first few frames according to the above equations (1) to (4). Since the data in the audio feature set has not been compared to most of the phoneme labels of the voice wake-up keyword at this time. The confidence level will change as the audio feature data being compared increases. If the voice data includes a voice wakeup keyword, the confidence level output by the confidence level calculation window 305 may increase as the audio feature data being compared increases. When the confidence coefficient reaches a certain threshold value, the voice awakening keyword is judged to be detected. For example, in equation (3), assuming a total of 30 phoneme labels, there may be a case where the confidence level has exceeded the threshold value when n is 25. At this time, the association probability between the 26 th-30 th phoneme label and the audio feature data can not be calculated any more, and the detection of the voice wakeup keyword can be directly judged. If the voice data does not include the voice wakeup keyword, the confidence level output by the confidence level calculation window 305 will always fail to reach a certain threshold, and thus it will be determined that no voice wakeup keyword is detected.

The keyword detection network 304 and confidence calculation window 305 described above may be calculated in parallel to reduce latency.

As described above, upon detecting the voice wake-up keyword, the two-class network 306 may be activated to make a wake-up decision on the audio feature set 302. Specifically, confidence calculation window 305 may send a particular signal to cache 303 when the confidence is greater than the threshold (e.g., step # of fig. 3). The buffer 303 sends its buffered audio feature data to the two-class network 306 (step (c) of fig. 3). Specifically, assume that at the j-th frame, the confidence level output by the confidence level calculation window 306 is greater than a threshold. At this time, the audio feature data of the j-p th frame to the j + p th frame (p is a natural number equal to or greater than 0) in the buffer may be synthesized into representative audio feature data, and input to the binary network 306. Of course, all of the audio feature numbers in the buffer may also be input to the classification network 306. The classification network 306 may determine whether to wake up the smart device 101 according to the method described above.

The two-class network 306 in fig. 3 is shown as a fully connected network, which is only an example, and those skilled in the art should determine that the two-class network 306 may also be other structures, for example, including a plurality of hidden layers, and the present disclosure does not set any limit to the structure thereof.

Fig. 4 is yet another schematic diagram illustrating a voice wake-up method 200 according to an embodiment of the present disclosure.

Referring to fig. 4, the voice wake-up method 200 of the present disclosure may be implemented by two modules, namely a high wake-up rate module 401 and a low false wake-up rate module 402.

The high-awakening-rate module 401 includes an FBANK feature calculation module, a feature cache module, a keyword detection network, and a posterior processing module. The FBANK feature calculation module is configured to calculate FBANK features of the audio input, for example, implement step (r) in fig. 3. The feature cache module is used for storing the calculated FBANK features, for example, implementing step two in fig. 3. The keyword detection network is used to detect the voice wake-up keywords and may be similar to the keyword detection network 304 in fig. 3. The a posteriori processing module is used to further process the association probability (also referred to as a posteriori probability since it is computed given the conditions/inputs) output by the keyword detection network, which may be similar to the confidence computation window 305 in fig. 3.

In particular, the keyword detection network is used in the high wake-up rate module 401 to implement wake-up word detection, and a higher wake-up rate can be achieved. For this reason, the audio data samples that train the keyword detection network may be clean and have a high signal-to-noise ratio. Assume that the keyword detection network is trained using a first set of voice data samples and that the signal-to-noise ratios of the individual voice data samples in the first set of voice data samples are averaged, the result of which may be a first average signal-to-noise ratio. The first average signal-to-noise ratio may be relatively high. For example, the first set of speech data samples may comprise samples a of which the user clearly uttered the wake-up word in a quiet environment, e.g. samples of speech data in which the user uttered "jingle". To be clearly distinguishable from sample a, the first set of speech data samples may also include sample B, where the user clearly uttered the randomly drawn non-speech wake-up keyword in a quiet environment, e.g., speech data samples where the user uttered words such as "goodbye", "hello", "weather true good".

The keyword detection network trained using the first speech data sample set may have a high false wake-up rate when processing input data with a low signal-to-noise ratio. The "false wake-up rate" refers to a probability that voice data not including the voice wake-up keyword is recognized as including the voice wake-up keyword. For example, when the processed voice data has a lot of music or tv noise, the keyword detection network may misidentify the voice data not including the voice wakeup keyword as including the voice wakeup keyword. For example, speech data containing "bye jingle" is erroneously recognized as speech data containing "mingle". For this reason, the low false wake-up rate module 402 can be used to make a wake-up decision on the voice data to reduce the false wake-up rate of the voice wake-up method 200.

The low false wake-up rate module 402 includes a two-class network and a threshold decision module. The two-class network is similar to the two-class network 306 in fig. 3. The threshold decision module is used for determining whether to wake up the intelligent electronic device based on the output of the two-classification network.

The low false wake-up rate module 402 uses a two-class network to implement the wake-up decision for the voice data to achieve a low false wake-up rate. The classification network is trained using a second set of voice data samples having a second average signal-to-noise ratio. The second average signal-to-noise ratio is less than the first average signal-to-noise ratio. For example, the data samples in the second set of speech data samples may be speech data samples synthesized from the sample data in the first set of speech data samples and various noise data. The noise data may be strong noise data, or may be real music, television background sound data, or the like. The second set of speech data samples may also include samples a' of the user speaking the wake-up utterance in a noisy environment. Of course, the second set of speech data samples may also include samples B' of the user speaking random non-speech wake-up keywords in a noisy environment.

After the training of the keyword detection network is completed, the second voice data sample set which has been previously marked whether to contain the voice wakeup keyword may be input into the keyword detection network. And according to the output of the keyword detection network, the voice data samples in the second voice data sample set classification are positive sample voice data and negative sample voice data. The positive sample voice data is voice data correctly recognized by the keyword detection network, and the negative sample voice data is voice data erroneously recognized by the keyword detection network. The binary network is trained using the positive sample voice data and the negative sample voice data.

The trained binary network can optimize the output result of the keyword detection network, namely judge the correctness of the result of the keyword detection network, so that most of mistaken awakening is effectively inhibited under the condition of ensuring higher awakening rate. Meanwhile, the two-class network is a light-weight neural network, so that excessive system overhead is not brought, and the awakening performance is remarkably improved under the condition that the system performance is not influenced.

Fig. 5 is a schematic diagram illustrating a voice wake-up apparatus 500 according to an embodiment of the present disclosure.

The voice wake-up apparatus 500 according to an embodiment of the present disclosure includes a voice data extraction module 501, a first processing module 502, and a second processing module 503. The voice data extraction module 501 is configured to obtain an audio feature set of the voice data. The first processing module 502 is configured to detect a voice wake-up keyword based on the set of audio features. The second processing module 503 is configured to, when a voice wakeup keyword is detected, perform a wakeup decision on the audio feature set by using a binary network.

The voice wake-up apparatus 500 further comprises: a wake-up module 504 for waking up the intelligent electronic device if the voice data is determined to be awake.

The two-class network in the voice wake-up unit 500 comprises a fully-connected neural network. Wherein, when a voice wake-up keyword is detected, using a two-class network to make a wake-up decision on the audio feature set comprises: and under the condition that the voice awakening keyword is detected, activating the full-connection neural network, synthesizing a plurality of audio characteristic data in the audio characteristic set into representative audio characteristic data, and utilizing the full-connection neural network to perform awakening judgment on the representative audio characteristic data.

The voice wake-up apparatus 500 respectively performs detection and wake-up decision of the voice wake-up keyword through the first processing module 502 and the second processing module 503 which are connected in series. Compared with the common voice wake-up technology, the method can achieve higher wake-up rate and obviously reduce false wake-up.

Specifically, the first processing module 502 may use the keyword detection network to perform detection of the voice wakeup keyword. The keyword detection network uses the association probability (also called acoustic model posterior probability) and confidence calculation between the acoustic model and the input voice data to make a wake-up decision on the input voice.

Optionally, the keyword detection network may cache the audio feature data with a fixed window size when performing the acoustic model posterior probability calculation and the confidence calculation. Determining that a voice wake up keyword is detected when the calculated confidence reaches a particular threshold. The first processing module 502 may then send the buffered fixed window size audio feature data to the second processing module 503.

The second processing module 503 may use a binary network to make a wake-up decision after receiving the audio feature data sent by the first processing module 502.

As described above, the binary network in the second processing block 503 may be determined by using sample data to which a large amount of noise data such as music and television is added. Because the two-class network is a lightweight network, the two-class network can obviously improve the false wake-up performance of the system under the condition of ensuring that excessive additional overhead is not brought to the system.

Fig. 6 is a block diagram illustrating an intelligent electronic device 600 according to an embodiment of the present disclosure.

Referring to fig. 6, the intelligent electronic device 600 may include a processor 601, a memory 602, and a voice acquisition unit 604. The processor 601, memory 602, and voice capture unit 604 may all be connected by a bus 503. The intelligent electronic device 600 may be an intelligent stereo, an intelligent television, an intelligent set-top box, or a smart phone, etc.

The processor 601 may perform various actions and processes according to programs stored in the memory 602. In particular, the processor 601 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which may be of the X86 or ARM architecture.

The memory 602 stores computer instructions that, when executed by the processor 601, implement the voice wake-up method 200 described above. The memory 602 may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memories of the methods described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

The voice collecting unit 604 may be an energy converting unit, such as a microphone, capable of converting a sound signal into an electrical signal. The speech acquisition unit 604 may perform the acousto-electric conversion in various forms: electric (moving coil type, aluminum tape type), capacitive (direct current polarization type), piezoelectric (crystal type, ceramic type), electromagnetic, carbon particle type, semiconductor type, and the like. The electrical signals collected by the voice collecting unit may be stored in the memory 602 in the form of a digital file.

The present disclosure also provides a computer readable storage medium having stored thereon computer instructions that, when executed by a processor, implement the voice wake-up method 200. Similarly, computer-readable storage media in embodiments of the disclosure may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. It should be noted that the computer-readable storage media described herein are intended to comprise, without being limited to, these and any other suitable types of memory.

The voice awakening method, the voice awakening device, the computer readable storage medium and the intelligent electronic equipment can solve the technical problems of large calculation amount, large delay, slow response and the like in the existing voice awakening technology, and provide the usability of the voice awakening technology.

It is to be noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In general, the various example embodiments of this disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic or any combination thereof. Certain aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of embodiments of the disclosure have been illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The exemplary embodiments of the invention, as set forth in detail above, are intended to be illustrative, not limiting. It will be appreciated by those skilled in the art that various modifications and combinations of the embodiments or features thereof may be made without departing from the principles and spirit of the invention, and that such modifications are intended to be within the scope of the invention.

Claims

1. An artificial intelligence based voice wake-up method, comprising:

acquiring an audio feature set of voice data;

detecting a voice awakening keyword by using a keyword detection network based on the audio feature set; and

in the case of detecting the voice wake-up keyword, making a wake-up decision on the audio feature set by using a two-class network,

wherein the keyword detection network is trained using a first set of speech data samples having a first average signal-to-noise ratio,

the classification network is trained using a second set of speech data samples having a second average signal-to-noise ratio, and the first average signal-to-noise ratio is higher than the second average signal-to-noise ratio.

2. The artificial intelligence based voice wakeup method of claim 1, wherein the two classification networks are fully connected neural networks,

wherein, when a voice wake-up keyword is detected, using a two-class network to make a wake-up decision on the audio feature set comprises:

activating the fully-connected neural network in the event that a voice wake-up keyword is detected,

and synthesizing the plurality of audio feature data in the audio feature set into representative audio feature data, and performing awakening judgment on the representative audio feature data by using the fully-connected neural network.

3. The artificial intelligence based voice wakeup method according to claim 1, wherein obtaining the set of audio features of the voice data comprises:

acquiring audio characteristic data of each frame of voice data; and

buffering the acquired audio characteristic data of each frame according to a preset buffering rule,

wherein the set of audio features comprises audio feature data of a plurality of consecutive frames.

4. The artificial intelligence based voice wakeup method of claim 3, wherein the voice wakeup keyword includes a plurality of phoneme tags, and detecting the voice wakeup keyword based on the set of audio features includes:

comparing each frame of audio characteristic data in the cache with the phoneme label of the voice awakening keyword by utilizing the keyword detection network to determine the association probability of the frame of audio characteristic data and the phoneme label;

and determining the confidence level of the voice awakening keyword detected in the audio feature set according to the association probability.

5. An artificial intelligence based voice wakeup method according to claim 3, wherein the predetermined caching rules include at least one of:

according to a first-in first-out rule, caching audio characteristic data of a preset number of continuous frames;

after detecting the predetermined phoneme label, buffering the audio feature data of a predetermined number of consecutive frames.

6. The artificial intelligence based voice wakeup method of claim 4,

and after the training of the keyword detection network is finished, training the two-classification network.

7. The artificial intelligence based voice wakeup method according to claim 6,

at least a portion of the first set of voice data samples is voice data that includes the voice wake-up keyword.

8. The artificial intelligence based voice wakeup method of claim 1, wherein:

and awakening the intelligent electronic equipment under the condition that the voice data is judged to be awakened.

9. A voice wake-up device, comprising:

the voice data extraction module is used for acquiring an audio feature set of the voice data;

the first processing module is used for detecting voice awakening keywords by using a keyword detection network based on the audio feature set; and

a second processing module, configured to perform a wake-up decision on the audio feature set by using a binary network when a voice wake-up keyword is detected,

10. The voice wake-up device of claim 9, further comprising:

and the awakening module is used for awakening the intelligent electronic equipment under the condition that the voice data is judged to be awakened.

11. The voice wake-up device of claim 9, wherein,

the two-class network is a fully-connected neural network,

12. An intelligent electronic device comprising:

the voice acquisition unit is used for acquiring voice data;

a processor for processing the received data, wherein the processor is used for processing the received data,

a memory having stored thereon computer instructions which, when executed by the processor, implement the method of any one of claims 1-8.

13. The intelligent electronic device of claim 12, wherein the intelligent electronic device is a smart stereo, a smart television, a smart set-top box, or a smart phone.

14. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any one of claims 1-8.