CN110364143A

CN110364143A - Voice awakening method, device and its intelligent electronic device

Info

Publication number: CN110364143A
Application number: CN201910747867.6A
Authority: CN
Inventors: 苏丹; 陈杰; 王珺; 俞栋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-08-14
Filing date: 2019-08-14
Publication date: 2019-10-22
Anticipated expiration: 2039-08-14
Also published as: CN110364143B

Abstract

Disclose a kind of voice awakening method, device and its intelligent electronic device based on artificial intelligence.The voice awakening method includes: the audio frequency characteristics set for obtaining voice data；Based on the audio frequency characteristics set, detects voice and wake up keyword；And in the case where detecting that voice wakes up keyword, wake-up judgement is carried out to the audio frequency characteristics set using two sorter networks.

Description

Voice awakening method, device and its intelligent electronic device

Technical field

This disclosure relates to field of speech recognition, relate more specifically to a kind of voice awakening method based on artificial intelligence, dress It sets and its intelligent electronic device.

Background technique

Voice wake-up refers to user and is interacted by voice with electronic equipment and realize electronic equipment from dormant state To the conversion of state of activation.Currently, network is often detected using relatively simple wake-up in the electronic equipment of low cost, False wake-up rate is relatively high.On the other hand, it in order to provide higher wake-up detection accuracy, then needs using complicated wake-up detection Network, this proposes requirements at the higher level to the computing capability of electronic equipment, can not be commonly used in various electronic equipments.

Summary of the invention

Embodiment of the disclosure provides voice awakening method, device and its intelligent electronic device based on artificial intelligence.

Embodiment of the disclosure provides a kind of voice awakening method based on artificial intelligence comprising: obtain voice number According to audio frequency characteristics set；Based on the audio frequency characteristics set, detects voice and wake up keyword；And it is closed detecting that voice wakes up In the case where keyword, wake-up judgement is carried out to the audio frequency characteristics set using two sorter networks.

Embodiment of the disclosure additionally provides a kind of voice Rouser comprising: speech data extraction module, for obtaining Take the audio frequency characteristics set of voice data；First processing module, for being based on the audio frequency characteristics set, detection voice wakes up crucial Word；And Second processing module, for detect voice wake up keyword in the case where, using two sorter networks to the audio spy Collection closes and carries out wake-up judgement.

Embodiment of the disclosure additionally provides a kind of computer readable storage medium, is stored thereon with computer program, It is characterized in that, which realizes the step in above-mentioned method when being executed by processor.

Embodiment of the disclosure additionally provides a kind of intelligent electronic device, which includes: voice collecting list Member, for acquiring voice data；Processor；Memory is stored thereon with computer instruction, in the computer instruction by the processing Device realizes the above method when executing.

Detailed description of the invention

In order to illustrate more clearly of the technical solution of the embodiment of the present disclosure, make below by required in the description to embodiment Attached drawing is briefly described.The accompanying drawings in the following description is only the exemplary embodiment of the disclosure.

Fig. 1 is to show the schematic diagram that scene is waken up according to the voice of the embodiment of the present disclosure.

Fig. 2 is the flow chart for showing the voice awakening method according to the embodiment of the present disclosure.

Fig. 3 is the schematic diagram for showing the voice awakening method according to the embodiment of the present disclosure.

Fig. 4 is the another schematic diagram for showing the voice awakening method according to the embodiment of the present disclosure.

Fig. 5 is the schematic diagram for showing the voice Rouser according to the embodiment of the present disclosure.

Fig. 6 is the structure chart for showing the intelligent electronic device according to the embodiment of the present disclosure.

Fig. 7 is the schematic diagram for showing a kind of terminal dual model system waken up for voice.

Specific embodiment

In order to enable the purposes, technical schemes and advantages of the disclosure become apparent, root is described in detail below with reference to accompanying drawings According to the example embodiment of the disclosure.Obviously, described embodiment is only a part of this disclosure embodiment, rather than this public affairs The whole embodiments opened, it should be appreciated that the disclosure is not limited by example embodiment described herein.

In the present description and drawings, there is substantially the same or similar steps and the same or similar attached drawing mark of element Note will be omitted the repeated description of these steps and element to indicate.Meanwhile in the description of the disclosure, term " the One ", " second " etc. is only used for distinguishing description, is not understood to indicate or imply relative importance or sequence.

The disclosure for ease of description, concept related with the disclosure introduced below.

Artificial intelligence (Artificial Intelligence, AI) is to utilize digital computer or digital computer control Machine simulation, extension and the intelligence for extending people of system, perception environment obtain knowledge and the reason using Knowledge Acquirement optimum By, method, technology and application system.In other words, artificial intelligence is a complex art of computer science, it attempts to understand The essence of intelligence, and produce a kind of new intelligence machine that can be made a response in such a way that human intelligence is similar.Artificial intelligence The design principle and implementation method for namely studying various intelligence machines make machine have the function of perception, reasoning and decision.

Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage, The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.

The key technology of voice technology (Speech Technology) has automatic speech recognition technology (ASR) and voice to close At technology (TTS) and sound groove recognition technology in e.It allows computer capacity to listen, can see, can say, can feel, being the hair of the following human-computer interaction Direction is opened up, wherein voice becomes following one of the man-machine interaction mode being most expected.Currently, automatic speech recognition technology by It is widely used in every field.Voice wake up one of branch as automatic speech recognition technology of detection technique also by It is widely used among various intelligent electronic devices, using one of the usual way as these intelligent electronic devices of wake-up.

Fig. 1 is the schematic diagram for showing the scene 100 that detection is waken up according to the voice of the embodiment of the present disclosure.

With reference to Fig. 1, in scene 100, user A and user B are realized by voice and the dialogue of smart machine 101 It is interacted with smart machine 101.

Smart machine 101 can be any smart machine, for example, intelligent electronic device (for example, intelligent sound box, intelligence electricity Depending on, intelligent gateway etc.), smart phone and intelligent vehicle-carried equipment etc..Smart machine 101, which can also be, to be taken in above equipment The voice assistant device of load, voice assistant software etc..Smart machine 101 is closed recognizing user and said correct voice and wake up When keyword, the content that keyword can be waken up according to voice executes various operations.For example, being waken up when user says correct voice When keyword (for example, the user A in Fig. 1 says " Doraemon "), smart machine 101 can recognize user and say correctly Voice wakes up keyword, and activates from dormant state to operating status.And when the voice that user says mistake wakes up keyword (example Such as, the user B in Fig. 1 says " goodbye ") when, smart machine 101 then continues to keep dormant state.

Detection technique is waken up it is usually necessary to use voice to realize above-mentioned scene 100.Voice wakes up detection (also referred to as Keyword positioning (keyword spotting, KWS)) technology refers to and detects whether it includes one in one section of voice data Specific sound bite.In general, this special sound segment includes that voice wakes up keyword, " Doraemon " of example as shown in figure 1.

The various systems that can be realized voice awakening technology include depth keyword system (Deep Kws System), close Keyword/filter Hidden Markov Model system, terminal dual model system and cloud second-level model system etc..In actual industry All there is respective defect using upper.

For example, depth keyword system is single model structure, and the wake-up of balance is obtained using deep neural network Performance.Since the system has only used single model structure, performance is difficult to reach in the case where far field, band such as make an uproar at the complex application contexts To enough discriminations.

Fig. 7 shows a kind of schematic diagram of terminal dual model system 700 waken up for voice.Terminal as shown in Figure 7 Dual model system is utilized two complicated neural networks and carries out a large amount of calculate to obtain relatively accurate wake-up result.Terminal Dual model system includes low calculation amount module 701 and accurately calculates module 702.Low calculation amount module 701 includes MFCC feature meter Calculate module, feature cache module, small-sized deep neural network (small-sized DNN) module, the first Hidden Markov score (the first HMM Score) module.Wherein, small-sized deep neural network module is respectively used to tentatively judge whether input voice is crucial with voice wake-up Word is related, and exports the first association probability.First Hidden Markov obtains sub-module and determines the first confidence according to the first association probability Degree.Accurately calculating module 702 includes large-scale deep neural network (large-scale DNN) module and the second Hidden Markov score (second HMM score) module.After low calculation amount module 701 detects that user has said voice wake-up keyword, by feature cache module In characteristic be input to the large-scale deep neural network module accurately calculated in module 702.Large-scale deep neural network mould Whether block judges input voice again related to voice wake-up keyword, and exports the second association probability to the second Hidden Markov Sub-module is obtained to obtain the second confidence level.Because terminal dual model system 700 uses the mind of two concatenated complexity at the terminal Through network, and second level neural network is bigger than the calculation amount of first order neural network, more computing resources is needed, to intelligence Can electronic equipment it is more demanding.

Cloud second-level model system carries out waking up judgement also with two above-mentioned neural networks, to mitigate terminal The calculation amount of side placed the complicated Secondary Neural Networks of system beyond the clouds.But due to the system need network with Cloud is verified, and there is technical issues that.

The disclosure proposes a kind of improved voice awakening method based on artificial intelligence, and this method is by using two classification nets Network can reduce calculation amount, shorten the accuracy for postponing and improving smart machine response as second level neural network.

Fig. 2 is the flow chart for showing the voice awakening method 200 according to the embodiment of the present disclosure.

Voice awakening method 200 according to an embodiment of the present disclosure can be applied in any smart machine, can also be with It executes and court verdict is back to in wake-up device beyond the clouds then.In the following, being carried out by taking the smart machine 101 in Fig. 1 as an example Explanation.

Firstly, obtaining the audio frequency characteristics set of voice data in step S201.

Specifically, above-mentioned voice data may include the sound being captured in a variety of manners and be converted into digital text The voice data of part form storage, for example, the voice data etc. periodically captured by the microphone of smart machine 101.Voice Data can be buffered in the memory of smart machine 101 to carry out next step analysis.Voice data can with .mp3, .wav .voc and .au format etc. is encoded or is stored.The disclosure does not carry out any restrictions to the format of voice data.

Each element in above-mentioned audio frequency characteristics set refers to the audio characteristic data that can be extracted from voice data. In order to characterize voice data and identify the voice data, it usually needs to the sound frequency of the voice data, volume, mood, sound The data such as height, energy are analyzed.These data can be referred to as " audio characteristic data " of the voice data.

For the ease of the analysis of voice data, above-mentioned audio characteristic data further can be special using various voices Sign is extracted model and is obtained.Speech feature extraction model includes but is not limited to FBANK (also known as FilterBank) or MFCC etc.. FBANK voice feature data is also known as by the audio characteristic data that FBANK speech feature extraction model extraction goes out.The disclosure will It is illustrated by taking FBANK voice feature data as an example, but the disclosure is not limited thereto.FBANK speech feature extraction model Audio frequency characteristics can be extracted in a manner of being similar to human ear and handle the sound that it is heard.FBANK speech feature extraction Model is by carrying out the operations such as Fourier transformation, energy spectrum calculating and Mel filtering, available energy to the voice data of framing Enough characterize the array (also referred to as FBank feature vector) of each frame voice data.The array is FBANK audio frequency characteristics number According to.

In step S202, it is based on audio frequency characteristics set, detection voice wakes up keyword.

Specifically, voice number can be detected by further analyzing each audio characteristic data in audio frequency characteristics set It whether include that voice wakes up keyword in.Voice, which wakes up keyword, can be the pre-set any keyword of user or intelligence Default keyword in equipment 101, " Doraemon " of example as shown in figure 1.The language that keyword is waken up including voice can be determined in advance The voice feature data of sound data.Then by audio frequency characteristics set audio characteristic data and these predetermined voices it is special Sign data are compared, and are matched so that it is determined that whether audio frequency characteristics set wakes up keyword with voice.For example, can in advance really The FBANK voice feature data of fixed " Doraemon " the words, then will obtain in the FBANK voice feature data and step S201 Audio frequency characteristics set be compared, determine whether to detect that voice wakes up keyword.

The step of above-mentioned detection voice wake-up keyword further includes that these audios spy is determined using keyword detection network Collection closes whether match with voice wake-up keyword.The keyword detection network can be various model structures, such as DNN, CNN or LSTM etc..The keyword detection network can be using acoustic model, and the acoustic model is using phoneme tags come really Whether accordatura frequency characteristic set, which wakes up keyword with voice, matches.Phoneme refer to according to the natural quality of voice mark off come Least speech unit is the determination according to the articulation in syllable.For example, including one if Chinese syllable ā () Phoneme, à i (love) then include two phonemes.Since voice, which wakes up keyword, can be divided into multiple phonemes, so as to benefit Indicate that voice wakes up the phonetic feature of keyword with multiple phoneme tags.Keyword detection network system can successively calculate sound Association probability of each audio data characteristics compared with voice wake-up keyword phoneme tags in frequency characteristic set.These are closed Join probability and carry out collect statistics, to obtain the confidence level for waking up keyword in voice data including voice.Confidence level is higher than pre- Determining threshold value indicates to detect voice keyword.

Certainly, keyword detection network, which can also be, can identify that voice wakes up other neural networks of keyword, for example, Hidden Markov neural network (HMM) and Gaussian Mixture neural network (CMM) etc..

It is special to the audio using two sorter networks in the case where detecting that voice wakes up keyword in step S203 Collection closes and carries out wake-up judgement.

Specifically, above-mentioned two sorter network (binary classification model can also be referred to as) refer to by input be divided into two classes ( That is, output is 0 or neural network 1).In the case where detecting that voice wakes up keyword, two above-mentioned sorter networks are swashed It is living, thus to further being adjudicated in audio frequency characteristics set.The model parameter amount of two sorter networks is much smaller than above-mentioned pass Keyword detects the model parameter amount of network model, therefore can reduce the calculation amount of system.Two sorter networks are to audio frequency characteristics collection The wake-up judgement of conjunction can execute beyond the clouds, can also execute in terminal, the disclosure is without limitation.

More specifically, two above-mentioned sorter networks may include multiple layers: input layer, at least one hidden layer and output Layer.It include multiple nodes in each hidden layer.Node can be neuron (cell) or perceptron, and each node can have multiple Input.Output layer includes the node less than or equal to 2.Each node can have different weight and biasing to its any input. And weight and the value of biasing are trained by sample data.

Two sorter networks can be the neural network connected entirely.The neural network connected entirely refers to: in the phase of neural network Each node in two layers adjacent is all connection.For example, each node in input layer and the hidden layer near input layer Each node is all connected with.Each node in each adjacent hidden layer is also interconnected.Near hiding for output layer Each node in layer is also connected with two nodes of output layer.It can be from more angles using the neural network connected entirely The audio characteristic data for analyzing input, to obtain more accurate court verdict.

Specifically, multiple audio characteristic datas in the audio frequency characteristics set can be synthesized and represents audio frequency characteristics number According to carrying out wake-up judgement to audio characteristic data is represented using the full Connection Neural Network." representing audio characteristic data " table Show the audio characteristic data that can characterize/represent the audio frequency characteristics set.For example, " representing audio characteristic data " can be in sound Chosen in frequency characteristic set the audio characteristic data of predetermined quantity in chronological sequence sequential concatenation together and the audio that is formed is special Levy data.After each element in audio frequency characteristics set can also be carried out other secondary treatments by " representing audio characteristic data " The audio characteristic data of extraction.The disclosure does not limit the concrete form of " representing audio characteristic data ", as long as the sound can be characterized Frequency characteristic set.

The input layer for audio characteristic data will be represented being input to full Connection Neural Network, it is defeated by least one hidden layer Layer can export " 0 " for indicating not wake up smart machine and indicate to wake up " 1 " of smart machine out.Output layer can also export one It is a to be more than or equal to 0 and the real number less than 1.It then indicates to wake up intelligent electronic device when the value of the real number is greater than predetermined threshold.By This, two sorter networks are completed the wake-up carried out to audio frequency characteristics set and are adjudicated.

In the case where voice data is judged as wake-up, smart machine 101 can be waken up.For example, when two sorter networks When positioned at cloud, Cloud Server can send signal to smart machine 101 by cable network and/or wireless network, to trigger Conversion of the smart machine 101 from dormant state to working condition.When two sorter networks are located at smart machine 101, judgement is waken up It can directly activate smart machine 101 to be transformed into working condition from dormant state.It is judged as not waking up when voice data In the case of, smart machine 101 can keep dormant state or do nothing.

It is realized as a result, by two sorter networks compared with mini Mod according to the voice awakening method 200 of the embodiment of the present disclosure Most false wake-up can be effectively inhibited in the case where parameter amount, to reduce calculation amount significantly, shorten delay simultaneously Improve the accuracy of smart machine response.Compared to the single model neural network that complexity is used only or use multiple identical frames For the common voice awakening technology of Complex Neural Network model, voice awakening method 200 is answered in complexity such as far field, high noisies With the level that can achieve industrial application under scene, correctly wake-up device, raising smart machine are whole in the case where low latency The ease for use of body.

Fig. 3 is the schematic diagram for showing the voice awakening method 200 according to the embodiment of the present disclosure.

As shown in figure 3, the audio frequency characteristics set 302 for obtaining voice data 301 may include each of acquisition voice data Frame audio characteristic data.

Specifically, referring to Fig. 3, voice data 301 can be divided into multiple frames with regular hour section.Usual situation Under, comprising complete speech wake up keyword voice data when it is 2 to 5 seconds a length of.It can be with every 10 milliseconds for a frame, by voice Data 301 are divided into multiple frames.In order to closer to human ear to the processing mode of voice data, between the voice data of two adjacent frames There can be lap.Such as first frame voice data can be the 0th millisecond to the 10th millisecond of data of the voice data, the Two frame voice data can be the 8th millisecond to the 18th millisecond of data of the voice data.

It is then possible to be handled each frame voice data to obtain the audio characteristic data of each frame of the voice data (the step of such as Fig. 3, is 1.).For example, can use above-mentioned FBANK model to obtain the FBANK audio characteristic data of each frame. Each frame audio characteristic data can be the array of L dimension, and wherein L is more than or equal to 1.Optionally, L is equal to 13.Audio frequency characteristics It may include the audio characteristic data of multiple continuous frames in set 302.

With continued reference to Fig. 3, acquired each frame audio characteristic data (such as Fig. 3 can be cached according to predetermined cache rule The step of 2.).For example, each frame audio characteristic data can be sequentially input into caching 303.Wherein, predetermined cache rule Including but not limited to: according to first in, first out rule, caching the audio characteristic data of the continuous frame of predetermined quantity；Or it is detecting After predetermined phoneme tags, the audio characteristic data of the continuous frame of predetermined quantity is cached.Optionally, the size for caching 303 can Just size needed for covering identification voice wake-up keyword.For example, it is assumed that identification voice wakes up keyword " Doraemon " about The audio characteristic data of M frame is needed, then the size for caching 303 can be M*L bit.

Caching 303 can sequentially input first frame to nth frame audio characteristic data to keyword detection network 304 (such as The step of Fig. 3, is 3.).To obtain more accurately as a result, keyword detection network 304 can be a complicated depth nerve net Network.Specifically, as shown in figure 3, keyword detection network 304 may include one or more hidden layers.It is wrapped in each hidden layer Multiple neurons (cell) is included, each neuron there can be multiple inputs.For example, closest to the nerve in the hidden layer of input layer Member input can be the data of Arbitrary Dimensions in the audio characteristic data of L dimension.Each neuron to it is each input have weight and partially It sets.Weight and the value of biasing are obtained by the training of a large amount of sample data.Keyword detection network 304 in Fig. 3 is only It is example, there can also be other structures.The disclosure is not to the section in the structure of keyword detection network 304, each layer Connection type between point quantity and node is limited.

Each frame audio characteristic data in caching is waken up keyword with voice by above-mentioned keyword detection network 304 Phoneme tags compare, to determine the association probability of the frame audio characteristic data and the phoneme tags.Keyword detection network 304 can be with the audio characteristic data of one frame of single treatment, can also be with the audio characteristic data of single treatment multiframe.With single treatment It is illustrated for the audio characteristic data of one frame.Keyword detection network 304 can calculate jth frame audio characteristic data and Associated probability P between i phoneme tags_ijNamely association probability P_ij.Wherein, i and j is the integer more than or equal to 0.Example Such as, keyword detection network 304 can be incited somebody to action when handling the first frame audio characteristic data of voice wake-up keyword " Doraemon " The voice wakes up first phoneme tags " x " of keyword compared with first frame audio characteristic data, and exports the first phoneme Label " x " probability P associated with first frame audio characteristic data₁₁。

Due to association probability P_ijNoise is usually contained, therefore can be adopted before the confidence level for calculating voice wake-up keyword It is smoothed with smoothing windows.For example, association probability P can be handled using following formula (1)_ijIt is smooth to obtain Association probability P afterwards_ij’。

In formula (1), k is indicated in h_smoothArbitrary value between j, h_smoothIndicate the first frame number in the smoothing windows According to index/frame number.h_smoothIt can be calculated with following formula (2):

h_smooth=max { 1, j-w_smooth+1} (2)

Above-mentioned w_smoothRefer to the size of smoothing windows.Such as when the size of smoothing windows is 6 frames, j=10, i=9, after smooth Association probability P_ij' it is that the audio characteristic data of audio characteristic data to the 10th frame of the 5th frame is closed with the 9th phoneme tags respectively The average value of the probability of connection, at this time h_smoothEqual to 5.It will be between the crucial probability that reduce continuous multiple frames by smoothing processing Noise so that confidence level is more accurate.

Then, keyword detection network 304 can be by smoothed out association probability P_ij' sequentially input one by one to confidence level meter It calculates window 305 (the step of such as Fig. 3 is 4.), without disposably calculating all association probability P_ij'.Keyword detection network 304 can calculate the confidence level for detecting that the voice wakes up word in jth frame.Assuming that the window of confidence calculations window 305 is big Small is w_max, w_maxGreater than 1.Specifically, confidence calculations window 305 can use following formula (3):

To calculate the confidence level for detecting that voice wakes up keyword in audio frequency characteristics set 304.In above-mentioned formula (3) In, n indicates the index of the phoneme tags currently calculated.For example, it is assumed that voice, which wakes up keyword, 30 phoneme tags, and it is current The 25th phoneme tags are being handled, then n is equal to 25 at this time.M indicates h_maxArbitrary value between j.h_maxIt indicates in confidence level Index/frame number of first frame in calculation window, h_maxIt can be obtained with following formula (4):

h_max=max { 1, j-w_max+1} (4)

Arrive formula (4) according to above-mentioned formula (1), confidence calculations window 305 usually former frames when the confidence level that exports compared with It is small.Because there are no most of phoneme tags that the data in audio frequency characteristics set are waken up keyword with voice to compare at this time Compared with.With the increase for the audio characteristic data being compared, confidence level will constantly change.If called out in voice data including voice Awake keyword, the confidence level that confidence calculations window 305 exports may be with the increase for the audio characteristic data being compared And increase.When confidence level reaches a certain specific threshold, that is, determine that detecting the voice wakes up keyword.Such as in formula (3) in, it is assumed that one shares 30 phoneme tags, it is understood that there may be the case where confidence level has been over threshold value as n=25.At this time The association probability between 26-30 phoneme tags and audio characteristic data can be no longer calculated, and directly judges to detect Voice wakes up keyword.If not including that voice wakes up keyword, the confidence that confidence calculations window 305 exports in voice data Degree cannot will reach specific threshold always, to will determine not detecting that voice wakes up keyword.

Above-mentioned keyword detection network 304 and confidence calculations window 305 can be with parallel computations to reduce delay.

As described above, two sorter networks 306 can be activated to come to audio frequency characteristics collection after detecting that voice wakes up keyword It closes 302 and carries out wake-up judgement.Specifically, when confidence level is greater than threshold value, confidence calculations window 305 can be sent out to caching 303 Send a signal specific (the step of such as Fig. 3 is 5.).The audio characteristic data that caching 303 has cached it is sent to two sorter networks 306 (the step of such as Fig. 3, is 6.).Specifically, it is assumed that in jth frame, the confidence level that confidence calculations window 306 exports is greater than threshold Value.At this point it is possible to which the audio characteristic data of jth-p frame to jth+p frame (p is the natural number more than or equal to 0) synthesizes in caching To represent audio characteristic data, and it is input to two sorter networks 306.It is of course also possible to which audio all in caching is special Sign number is input to two sorter networks 306.Two sorter networks 306 can judge whether to wake up smart machine according to above-mentioned method 101。

Two sorter networks 306 in Fig. 3 are shown in a manner of fully-connected network, only as example, this field Technical staff should determine that two sorter networks 306 can also be other structures, and for example including multiple hidden layers, the disclosure is not Any restrictions are done to its structure.

Fig. 4 is the another schematic diagram for showing the voice awakening method 200 according to the embodiment of the present disclosure.

With reference to Fig. 4, the voice awakening method 200 of the disclosure can be realized by two modules namely high wake-up rate module 401 and low false wake-up rate module 402.

High wake-up rate module 401 includes FBANK feature calculation module, feature cache module, keyword detection network, posteriority Processing module.Wherein, FBANK feature calculation module is used to calculate the FBANK feature of audio input, such as realizes the step in Fig. 3 Suddenly 1..2. feature cache module is for storing the step in FBANK feature calculated, such as realization Fig. 3.Keyword detection net Network wakes up keyword for detecting voice, which can be similar to the keyword detection network 304 in Fig. 3. The posteriority processing module association probability that keyword detection network exports for further processing is (since the association probability is given Calculated in the case where condition/input, which is also referred to as posterior probability), the confidence level meter that can be similar in Fig. 3 Calculate window 305.

Specifically, it is realized in high wake-up rate module 401 using keyword detection network and wakes up word detection, and can be real Existing higher wake-up rate.For this purpose, the audio data sample of the training keyword detection network can be clearly and noise is relatively high 's.Assuming that training the keyword detection network using the first voice data sample set, and to the first voice data sample set In the signal-to-noise ratio of each voice data sample average, result can be the first average signal-to-noise ratio.First average noise Than can be relatively high.For example, the first voice data sample set may include that user clearly says language in quiet environment Sound wakes up the sample A of keyword, such as user says the voice data sample of " Doraemon ".In order to the apparent area sample A Point, the first voice data sample set can also clearly say the non-voice randomly selected including user in quiet environment and call out Wake up the sample B of keyword, for example, user say " goodbye ", " hello ", " weather is very good " this kind of words voice data sample This.

Using the keyword detection network of above-mentioned the first voice data sample set training in the input for handling low signal-to-noise ratio When data it is possible that the case where high false wake-up rate." false wake-up rate ", which refers to, will not include the voice number of voice wake-up keyword According to be identified as comprising voice wake up keyword probability.For example, there is the voice data when processing a large amount of music or TV to make an uproar When sound, it includes that the voice is called out that keyword detection network, which will may not include voice to wake up the voice data wrong identification of keyword, Awake keyword.For example, the voice data mistake comprising " goodbye ding-dong " is identified as the voice data comprising " Doraemon ".For This, can use low false wake-up rate module 402 and carry out wake-up judgement to the voice data, to reduce voice awakening method 200 False wake-up rate.

Low false wake-up rate module 402 includes two sorter networks and threshold value judging module.Two sorter network is similar in Fig. 3 Two sorter networks 306.Threshold value judging module is then used to determine whether based on the output of two sorter networks to wake up smart electronics Equipment.

Low false wake-up rate module 402 realizes that the wake-up to above-mentioned voice data is adjudicated using two sorter networks, to realize Low false wake-up rate.Two sorter network is trained using second speech data sample set, second speech data sample set tool There is the second average signal-to-noise ratio.Second average signal-to-noise ratio is less than the first average signal-to-noise ratio.For example, in second speech data sample set Data sample can be the voice data sample that the sample data in the first voice data sample set is synthesized with various noise datas. Noise data can be strong noise data, be also possible to true music, TV background sound data etc..Second speech data sample Collection also may include that user says the sample A ' that voice wakes up keyword in noisy environment.Certainly, second speech data sample This collection can also include that user says the sample B that random non-voice wakes up keyword in noisy environment '.

It completes keyword detection network after training, will can be marked in advance whether to wake up comprising voice and close The second speech data sample set of keyword inputs keyword detection network.According to the output of keyword detection network, by the second language Voice data sample in the classification of sound set of data samples is positive sample voice data and negative sample voice data.Positive sample voice number According to being voice data that keyword detection network correctly identifies, and negative sample voice data is the identification of keyword detection network error Voice data.Two sorter networks are trained using the positive sample voice data and negative sample voice data.

Two sorter networks after the completion of training can the output result to keyword detection network optimize, namely to pass The result that keyword detects network carries out the judgement of correctness, effectively presses down to realize in the case where guaranteeing higher wake-up rate System falls most false wake-up.Simultaneously as two sorter networks are the neural networks of lightweight, it can't bring and excessive be Expense of uniting is obviously improved wake-up performance to realize in the case where not influencing system performance.

Fig. 5 is the schematic diagram for showing the voice Rouser 500 according to the embodiment of the present disclosure.

Voice Rouser 500 according to an embodiment of the present disclosure includes that speech data extraction module 501, first handles mould Block 502 and Second processing module 503.Wherein, speech data extraction module 501 is used to obtain the audio frequency characteristics collection of voice data It closes.First processing module 502 is used to be based on the audio frequency characteristics set, and detection voice wakes up keyword.Second processing module 503 Sentence for wake up to the audio frequency characteristics set using two sorter networks in the case where detecting that voice wakes up keyword Certainly.

Voice Rouser 500 further include: wake-up module 504 is used to be judged as wake-up in the voice data In the case of, wake up intelligent electronic device.

Two sorter networks in voice Rouser 500 include full Connection Neural Network.Wherein, detecting that voice calls out It wakes up in the case where keyword, the audio frequency characteristics set is carried out waking up judgement including: to detect language using two sorter networks In the case that sound wakes up keyword, full Connection Neural Network is activated, multiple audio characteristic datas in audio frequency characteristics set are closed As audio characteristic data is represented, wake-up judgement is carried out to the audio characteristic data that represents using full Connection Neural Network.

Voice Rouser 500 carries out voice by concatenated first processing module 502 and Second processing module 503 respectively It wakes up the detection of keyword and wakes up judgement.Compared to common voice awakening technology, higher wake-up rate can be obtained and shown Landing reduces false wake-up.

Specifically, above-mentioned keyword detection network can be used to carry out voice and wake up key in first processing module 502 The detection of word.The keyword detection network is using acoustic model and inputs association probability (also referred to as, the acoustics between voice data Model posterior probability) and confidence calculations wake-up judgement is carried out to the voice of input.

Optionally, keyword detection network can cache when acoustic model posterior probability is calculated with confidence calculations The audio characteristic data of fixed window size.It is waken up when confidence level calculated reaches specific threshold, confirmly detecting voice Keyword.Later, the audio characteristic data of the fixed window size cached can be sent to second by first processing module 502 Processing module 503.

Second processing module 503 can be used two points after the audio characteristic data for receiving the transmission of first processing module 502 Class network carries out wake-up judgement.

As described above, two sorter networks in Second processing module 503 can be with joined a large amount of music, TV etc. What the sample data of noise data was judged.Since two above-mentioned sorter networks are the network of lightweight, two sorter network The false wake-up performance of system can be significantly improved in the case where excessive additional overhead will not be brought to system by guaranteeing.

Fig. 6 is the structure chart for showing the intelligent electronic device 600 according to the embodiment of the present disclosure.

Referring to Fig. 6, intelligent electronic device 600 may include processor 601, memory 602 and voice collecting unit 604. Processor 601, memory 602 and voice collecting unit 604 can be connected by bus 503.Intelligent electronic device 600 can be with It is intelligent sound, smart television, Intelligent set top box or smart phone etc..

Processor 601 can execute various movements and processing according to the program of storage in the memory 602.Specifically, locate Reason device 601 can be a kind of IC chip, the processing capacity with signal.Above-mentioned processor can be general processor, Digital signal processor (DSP), specific integrated circuit (ASIC), ready-made programmable gate array (FPGA) or other programmable patrol Collect device, discrete gate or transistor logic, discrete hardware components.It may be implemented or execute in the embodiment of the present application Disclosed each method, step and logic diagram.General processor can be microprocessor or the processor be also possible to it is any Conventional processor etc. can be X86-based or ARM framework.

Memory 602 is stored with computer instruction, realizes that above-mentioned voice is called out when computer instruction is executed by processor 601 Awake method 200.Memory 602 can be volatile memory or nonvolatile memory, or may include volatibility and non-volatile Both property memories.Nonvolatile memory can be read-only memory (ROM), programmable read only memory (PROM), erasable Except programmable read only memory (EPROM), electrically erasable programmable read-only memory (EEPROM) or flash memory.Volatile memory It can be random access memory (RAM), be used as External Cache.It is many by exemplary but be not restricted explanation The RAM of form is available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic Random access memory (SDRAM), double data speed synchronous dynamic RAM DDRSDRAM), enhanced synchronization Dynamic random access memory (ESDRAM), synchronized links dynamic random access memory (SLDRAM) and direct rambus with Machine accesses memory (DR RAM).It should be noted that the memory of method described herein be intended to include but be not limited to these and it is any The memory of other suitable types.

Voice collecting unit 604 can be the energy conversion unit that voice signal can be converted to electric signal, such as wheat Gram wind.Voice collecting unit 604 can carry out acoustic-electric conversion: electrodynamic type (moving-coil type, aluminium band type), condenser type in a variety of manners (direct current polarization formula), piezoelectric type (crystal formula, ceramic-type) and electromagnetic type, carbon granules formula, semiconductor-type etc..Voice collecting unit The electric signal of acquisition can be stored in memory 602 in a manner of digital document.

The disclosure additionally provides a kind of computer readable storage medium, is stored thereon with computer instruction, which refers to Voice awakening method 200 is realized when order is executed by processor.Similarly, the computer readable storage medium in the embodiment of the present disclosure It can be volatile memory or nonvolatile memory, or may include both volatile and non-volatile memories.It should be noted that Computer readable storage medium described herein is intended to include but is not limited to the memory of these and any other suitable type.

Voice awakening method, device, computer readable storage medium and the intelligent electronic device of the embodiment of the present disclosure, can be with It solves that computationally intensive, delay in current voice awakening technology is big or the response technical problems such as slowly, voice is provided and wakes up skill The ease for use of art.

It should be noted that flow chart and block diagram in attached drawing, illustrate the system according to the various embodiments of the disclosure, side The architecture, function and operation in the cards of method and computer program product.In this regard, every in flowchart or block diagram A box can represent a part of a module, program segment or code, a part packet of the module, program segment or code Containing one or more executable instructions for implementing the specified logical function.It should also be noted that in some realities as replacement In existing, function marked in the box can also occur in a different order than that indicated in the drawings.For example, two earth's surfaces in succession The box shown can actually be basically executed in parallel, they can also be executed in the opposite order sometimes, this is according to related Depending on function.It is also noted that each box in block diagram and or flow chart and the box in block diagram and or flow chart Combination, can the dedicated hardware based systems of the functions or operations as defined in executing realize, or can be with dedicated The combination of hardware and computer instruction is realized.

In general, the various example embodiments of the disclosure can in hardware or special circuit, software, firmware, logic, or Implement in any combination thereof.Some aspects can be implemented within hardware, and other aspects can be can be by controller, micro process Implement in the firmware or software that device or other calculating equipment execute.When the various aspects of embodiment of the disclosure are illustrated or described as When block diagram, flow chart or other certain graphical representations of use, it will be understood that box described herein, device, system, techniques or methods Can be used as unrestricted example hardware, software, firmware, special circuit or logic, common hardware or controller or other It calculates and implements in equipment or its certain combination.

The example embodiments of the present invention being described in detail above is merely illustrative, rather than restrictive.Ability Field technique personnel, can be to these embodiments or its feature it should be understood that without departing from the principles and spirit of the present invention It carry out various modifications and combines, such modification should be fallen within the scope of the present invention.

Claims

1. a kind of voice awakening method based on artificial intelligence comprising:

Obtain the audio frequency characteristics set of voice data；

Based on the audio frequency characteristics set, detects voice and wake up keyword；And

In the case where detecting that voice wakes up keyword, the audio frequency characteristics set wake up using two sorter networks and is sentenced Certainly.

2. the voice awakening method based on artificial intelligence as described in claim 1, wherein two sorter network includes connecting entirely Neural network is connect,

Wherein, in the case where detecting that voice wakes up keyword, the audio frequency characteristics set is carried out using two sorter networks Waking up judgement includes:

In the case where detecting that voice wakes up keyword, the full Connection Neural Network is activated,

Multiple audio characteristic datas in the audio frequency characteristics set are synthesized and represent audio characteristic data, are connected entirely using described It connects neural network and wake-up judgement is carried out to the audio characteristic data that represents.

3. the voice awakening method based on artificial intelligence as described in claim 1, wherein obtain the audio frequency characteristics of voice data Set includes:

Obtain each frame audio characteristic data of voice data；And

According to predetermined cache rule, acquired each frame audio characteristic data is cached,

Wherein, audio frequency characteristics set includes the audio characteristic data of multiple continuous frames.

4. the voice awakening method based on artificial intelligence as claimed in claim 3, wherein the voice wakes up keyword and includes Multiple phoneme tags, and keyword is waken up based on audio frequency characteristics set detection voice and includes:

Using keyword detection network, by the phoneme of each frame audio characteristic data and voice wake-up keyword in caching Label compares, to determine the association probability of the frame audio characteristic data and the phoneme tags；

According to the association probability, the confidence for detecting that the voice wakes up keyword in the audio frequency characteristics set is determined Degree.

5. the voice awakening method based on artificial intelligence as claimed in claim 3, wherein the predetermined cache rule include with It is at least one of lower:

According to first in, first out rule, the audio characteristic data of the continuous frame of predetermined quantity is cached；

After detecting predetermined phoneme tags, the audio characteristic data of the continuous frame of predetermined quantity is cached.

6. the voice awakening method based on artificial intelligence as claimed in claim 4, wherein

Two sorter network is completed and then trained in the keyword detection network training.

7. the voice awakening method based on artificial intelligence as claimed in claim 6, wherein

The keyword detection network utilizes the training of the first voice data sample set；

Wherein, the first voice data sample set has the first average signal-to-noise ratio, also, in the first voice data sample set at least A part is the voice data for including the voice wake-up keyword.

8. the voice awakening method based on artificial intelligence as claimed in claim 7, in which:

Two sorter network utilizes the training of second speech data sample set；

Wherein, second speech data sample set has the second average signal-to-noise ratio, and the first average signal-to-noise ratio is higher than second and is averaged Signal-to-noise ratio.

9. the voice awakening method based on artificial intelligence as described in claim 1, in which:

In the case where the voice data is judged as wake-up, intelligent electronic device is waken up.

10. a kind of voice Rouser comprising:

Speech data extraction module, for obtaining the audio frequency characteristics set of voice data；

First processing module, for being based on the audio frequency characteristics set, detection voice wakes up keyword；With

Second processing module, for detect voice wake up keyword in the case where, using two sorter networks to the audio Characteristic set carries out wake-up judgement.

11. voice Rouser as claimed in claim 10, further include:

Wake-up module, for waking up intelligent electronic device in the case where the voice data is judged as wake-up.

12. voice Rouser as claimed in claim 10, wherein

Two sorter network includes full Connection Neural Network,

13. a kind of intelligent electronic device, comprising:

Voice collecting unit, for acquiring voice data；

Processor,

Memory is stored thereon with computer instruction, and such as right is realized when the computer instruction is executed by the processor It is required that method described in any one of 1-9.

14. intelligent electronic device as claimed in claim 13, wherein the intelligent electronic device is intelligent sound, intelligence electricity Depending on, Intelligent set top box or smart phone.

15. a kind of computer readable storage medium, is stored thereon with computer instruction, the computer instruction is executed by processor Method of the Shi Shixian as described in any one of claim 1-9.