CN109036412A

CN109036412A - voice awakening method and system

Info

Publication number: CN109036412A
Application number: CN201811081600.XA
Authority: CN
Inventors: 王欢良; 鄢楷强; 张宏阳; 沈旭晖; 马殿昌; 李显光
Original assignee: Suzhou Qdreamer Network Science And Technology Co Ltd
Current assignee: Suzhou Qdreamer Network Science And Technology Co Ltd
Priority date: 2018-09-17
Filing date: 2018-09-17
Publication date: 2018-12-18

Abstract

The present invention relates to a kind of speech recognition awakening method and systems, and wherein method to obtain the corresponding speech frame of the raw tone, and extracts the acoustic feature information of the speech frame the following steps are included: to primary voice data framing windowing operation；It carries out the acoustic feature information that deep neural network disaggregated model is calculated；Live voice data is enrolled, extracts the corresponding acoustic feature information of the live voice data, and by the corresponding acoustic feature information input of the live voice data to the deep neural network disaggregated model, to obtain posterior probability information；And by described compared with pre-set threshold, when the confidence level is greater than the pre-set threshold, wake-up voice recording device.The above method effectively promotes the wake-up performance under noise scene；The simulation such as word speed, pitch, volume is carried out to initial data, effectively promotes wake-up system to the adaptability of different speakers.

Description

Voice awakening method and system

Technical field

The present invention relates to field of speech recognition, more particularly to a kind of voice awakening method and system.

Background technique

Voice awakening technology is an important branch in field of speech recognition, is widely used in mobile phone terminal, intelligence In the voice interactive systems such as household, vehicle mounted guidance, user-friendly phonetic order wake-up device.More specifically, voice wakes up The task of system is to detect some wake-up word predetermined automatically from the voice received incessantly from the background, generally Also referred to as keyword detection (Keyword Spotting, KWS), when system detection is to corresponding keyword, equipment is called out It wakes up, and enters specific working condition.

Currently, mainly evaluating the performance that a voice wakes up system using two indices: one is accidentally to refuse rate (False Reject Rate, FRR), refer to that system will wake up the probability of word missing inspection；One is false alarm rate (False Alarm Rate, FAR), Finger system is by non-wake-up word misrecognition at the probability for waking up word, also referred to as false wake-up rate.False wake-up rate generally also can be used separately One index is measured, i.e., the false wake-up number occurred whithin a period of time, and such as 1 time/12 hours.Theoretically, rate and mistake are accidentally refused Alert rate is a pair of conflicting index: accidentally refusing rate to reduce, false alarm rate is likely to rise；On the contrary, if being missed to reduce Alert rate, accidentally the rate of refusing is also likely to rise.

One voice of good performance, which wakes up system and should be provided simultaneously with, lower accidentally refuses rate and lower false alarm rate: especially It is in fields such as smart homes, excessively high false alarm rate will influence normal communication, leisure or the amusement of user to a certain extent, recruit It applies the dislike at family；And on the other hand, under the complex scenes such as common far field, noise, the excessively high accidentally rate of refusing be will be greatly reduced The actual use of intelligent sound equipment is experienced.How under the premise of controlling lower false alarm rate, reduce as far as possible various multiple Rate is accidentally refused under miscellaneous scene, the robustness that wake-up system changes the word speed of speaker, accent is improved, is one urgently to be resolved Problem.

Summary of the invention

Based on this, it is necessary under the premise of the lower false alarm rate of above-mentioned control, how to reduce various complicated fields as far as possible The problem of accidentally refusing rate under scape, and how improving the robustness that wake-up system changes the word speed of speaker, accent, provides one kind Voice awakening method and system.

A kind of voice awakening method, comprising the following steps:

The corresponding environmental audio data of scene applied by typing original audio data and acquisition voice recording device, according to The original audio data is converted environment speech simulation data by environmental audio data；

Framing windowing operation is carried out to primary voice data and/or analog voice data, with obtain the raw tone and/ Or the corresponding speech frame of analog voice, and extract the acoustic feature information of the speech frame；

The acoustic feature information is calculated, to obtain wake-up word class that the speech frame is included at least and non- Wake up the deep neural network disaggregated model of word class；

Live voice data is enrolled, extracts the corresponding acoustic feature information of the live voice data, and this is described existing Voice data corresponding acoustic feature information input in field is to the deep neural network disaggregated model, to obtain the field speech The posterior probability information of data；

According to the posterior probability information calculate it is described admission live voice data confidence level, and by the confidence level with Pre-set threshold compares, when the confidence level is greater than the pre-set threshold, wake-up voice recording device, when the confidence level is small In the pre-set threshold, voice recording device is not waken up and further obtains user instruction.

In a wherein preferred embodiment, answered in the typing original audio data and acquisition voice recording device The corresponding environmental audio data of scene convert environment voice mould for the original audio data according to environmental audio data In the step of quasi- data, the environment speech simulation data include the noise simulation to original audio data, word speed simulation, reverberation The one of them or multinomial of simulation, tone and loudness simulation.

In a wherein preferred embodiment, primary voice data and the progress framing of analog voice data are added described Window operation, to obtain the raw tone and/or the corresponding speech frame of analog voice, and extracts the acoustic feature of the speech frame After the step of information, further includes:

Denoising is carried out to the acoustic feature information of the speech frame.

In a wherein preferred embodiment, in the admission live voice data, the live voice data is extracted Corresponding acoustic feature information, and by the corresponding acoustic feature information input of the live voice data to the depth nerve Network class model, the step of to obtain the posterior probability information of the live voice data in, further includes:

Denoising is carried out to the corresponding acoustic feature information of the field data.

Above-mentioned speech recognition awakening method in present embodiment can effectively promote the wake-up performance under noise scene, solution Robustness problem of the system in word speed, the accent variation of speaker is certainly waken up, the reality of intelligent sound equipment is substantially improved Usage experience.The simulation such as word speed, pitch, volume is carried out to initial data, wake-up system is effectively promoted and different speakers is fitted Ying Xing.

A kind of voice wake-up system, comprising:

Voice data analog module to typing original audio data and obtains scene pair applied by voice recording device The environmental audio data answered convert environment speech simulation data for the original audio data according to environmental audio data；

Characteristic extracting module, to primary voice data and/or analog voice data framing windowing operation, to obtain Raw tone and/or the corresponding speech frame of analog voice are stated, and extracts the acoustic feature information of the speech frame；

Depth network neural module, to calculate the acoustic feature information, to obtain the speech frame institute extremely Include less wakes up word class and the non-deep neural network disaggregated model for waking up word class；

It wakes up decision-making module and extracts the corresponding acoustic feature of the live voice data to enroll live voice data Information, and by the corresponding acoustic feature information input of the live voice data to the deep neural network disaggregated model, To obtain the posterior probability information of the live voice data, the admission field speech is calculated according to the posterior probability information Data confidence, and by the confidence level compared with pre-set threshold, when the confidence level is greater than the pre-set threshold, wake-up Voice recording device.

In a wherein preferred embodiment, the system also includes:

Denoising carries out denoising from coding module, to the acoustic feature information to the speech frame.

In a wherein preferred embodiment, decision-making module is waken up further include:

Unit is denoised, to carry out denoising to the corresponding acoustic feature information of the data.

Above-mentioned speech recognition in present embodiment, which wakes up system, can effectively promote the wake-up performance under noise scene, solve Robustness problem of the system in word speed, the accent variation of speaker is certainly waken up, the reality of intelligent sound equipment is substantially improved Usage experience.The simulation such as word speed, pitch, volume is carried out to initial data, wake-up system is effectively promoted and different speakers is fitted Ying Xing.

Detailed description of the invention

Fig. 1 is a kind of flow chart of voice awakening method of a preferred embodiment of the invention；

Fig. 2 is that a kind of voice of a preferred embodiment of the invention wakes up the module diagram of system.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.

It should be noted that it can directly on the other element when element is referred to as " being set to " another element Or there may also be elements placed in the middle.When an element is considered as " connection " another element, it, which can be, is directly connected to To another element or it may be simultaneously present centering elements.Term as used herein " vertical ", " horizontal ", " left side ", " right side " and similar statement for illustrative purposes only, are not meant to be the only embodiment.

Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.Term " and or " used herein includes one or more phases Any and all combinations of the listed item of pass.

As shown in Figure 1, a kind of voice awakening method of a preferred embodiment of the invention, method includes the following steps:

S10: the corresponding environmental audio data of scene applied by typing original audio data and acquisition voice recording device, Environment speech simulation data are converted by the original audio data according to environmental audio data.

In this step, the original audio data that operator can be original, clean by the typing in voice recording device, And the environmental factors such as noise, word speed, reverberation, tone and the loudness of place scene of the voice recording device are simulated, it will Above-mentioned original audio data is converted into environment speech simulation data.

S20: to primary voice data and/or analog voice data framing windowing operation, with obtain the raw tone and/ Or the corresponding speech frame of analog voice, and extract the acoustic feature information of the speech frame；

Extracting corresponding voice to primary voice data and/or analog voice data in the way of framing adding window Frame, and extract the acoustic feature information of the speech frame.

Then feature extraction carried out to above-mentioned each speech frame, in present embodiment, above-mentioned phonetic feature can be filtering Device group (filter bank, i.e. fbank), or other phonetic features, this is not limited by the present invention.

S30: denoising is carried out to the acoustic feature information of the speech frame.

In this step, denoising is carried out to the acoustic feature information of above-mentioned speech frame.Specifically, noise simulation voice The corresponding raw tone feature of feature can be used for training denoising self-encoding encoder: present embodiment uses full Connection Neural Network Construction denoising self-encoding encoder, according to the operational capability of system, usually using layer 2-3 hidden layer network, every layer includes 256 or 512 Node, and according to mean square error (Mean-Square Error, MSE) minimize criterion, by the way of stochastic gradient descent into The training of row denoising self-encoding encoder.

S40: the acoustic feature information is calculated, to obtain the wake-up word class that the speech frame is included at least And the non-deep neural network disaggregated model for waking up word class；

It is original using the generation of large vocabulary Continuous Speech Recognition System firstly, for the acoustic feature information of above-mentioned speech frame Audio data and corresponding pressure alignment information (phoneme level or syllable grade) of environment speech simulation data, and non-wake-up word is relevant Phoneme or syllable are uniformly labeled as filler, and in present embodiment, the acoustic feature of the above-mentioned speech frame of above-mentioned steps inputs convolution The voice of neural network wakes up model, and is based on cross entropy criterion, under a large amount of data by way of stochastic gradient descent It is trained, final optimization pass obtains the corresponding deep neural network disaggregated model of acoustic feature information of above-mentioned speech frame.

In addition to above-mentioned convolutional neural networks, above-mentioned depth network class model can also be full Connection Neural Network, time delay Neural network etc..

S50: admission live voice data extracts the corresponding acoustic feature information of the live voice data, and by the institute The corresponding acoustic feature information input of live voice data is stated to the deep neural network disaggregated model, to obtain the scene The posterior probability information of voice data；

In this step, live voice data is enrolled, which can be tested speech, or true language Sound data, this is not limited by the present invention, the extraction of acoustic feature information is carried out to the live voice data of the typing, and should Deep neural network disaggregated model in the corresponding acoustic feature information input above-mentioned steps of live voice data, obtains the scene The posterior probability information of non-wake-up word class and wake-up word class that the corresponding acoustic feature information of voice data is included；

It can also include that denoising is carried out to admission live voice data, specific processing mode and S30 are walked in this step Identical to the mode of the denoising of the first acoustic feature information in rapid, the present invention repeats no more this.

S60: the admission live voice data is calculated in depth network class model according to the posterior probability information Confidence level when the confidence level is greater than the pre-set threshold, wake up voice input and by described compared with pre-set threshold Equipment.

In this step, according to the distribution of the posterior probability information of live voice data in above-mentioned steps, and then enrolled Live voice data confidence level, and the confidence level is compared with pre-set threshold, judging result is obtained, when wake-up word confidence level When greater than pre-set threshold, wake up speech ciphering equipment, otherwise, when the confidence level be less than the pre-set threshold, do not wake up voice Recording device simultaneously further obtains user instruction.

Above-mentioned voice awakening method in present embodiment can effectively promote the wake-up performance under noise scene, solve to call out Robustness problem of the system of waking up in word speed, the accent variation of speaker, substantially improves the actual use of intelligent sound equipment Experience.The simulation such as word speed, pitch, volume is carried out to initial data, effectively promotes wake-up system to the adaptability of different speakers.

As shown in Fig. 2, another preferred embodiment of the present invention, which discloses a kind of speech recognition, wakes up system described in system 100 Including voice data analog module 110, characteristic extracting module 120, deep neural network module 130, wake up decision-making module 140.

Above-mentioned voice data analog module 110 is to typing original audio data and obtains applied by voice recording device The corresponding environmental audio data of scene convert environment speech simulation number for the original audio data according to environmental audio data According to.

Operator can be original, clean by 110 typing of voice data analog module original audio data, and to this The environmental factors such as noise, word speed, reverberation, tone and the loudness of place scene of voice recording device are simulated, by above-mentioned original Beginning audio data is converted into environment speech simulation data.

Features described above extraction module 120 to primary voice data and/or analog voice data framing windowing operation, with The raw tone and/or the corresponding speech frame of analog voice are obtained, and extracts the acoustic feature information of the speech frame.

Features described above extraction module 120 in the way of framing adding window to primary voice data and/or analog voice Data extract corresponding speech frame, and extract the acoustic feature information of the speech frame.Then to above-mentioned each speech frame into Row feature extraction, in present embodiment, above-mentioned phonetic feature can be filter group (filter bank, i.e. fbank), can also Think other phonetic features, this is not limited by the present invention.

This system can also include denoising from coding module 150, and denoising is from coding module 150 to the speech frame Acoustic feature information carries out denoising.

It denoises from acoustic feature information of the coding module to above-mentioned speech frame and carries out denoising.Specifically, pass through noise The corresponding raw tone feature of the feature of analog voice can be used for training denoising self-encoding encoder: present embodiment uses to be connected entirely Neural network configuration denoising self-encoding encoder is connect, according to the operational capability of system, usually using layer 2-3 hidden layer network, every layer includes 256 or 512 nodes, and criterion is minimized according to mean square error (Mean-Square Error, MSE), using under stochastic gradient The mode of drop carries out the training of denoising self-encoding encoder.

Depth network neural module 130 is to calculate the acoustic feature information, to obtain the speech frame institute What is included at least wakes up word class and the non-deep neural network disaggregated model for waking up word class.

Firstly, depth network neural module 130 is continuous using large vocabulary for the acoustic feature information of above-mentioned speech frame Speech recognition system generates original audio data and corresponding pressure alignment information (phoneme level or the syllable of environment speech simulation data Grade), and the relevant phoneme of non-wake-up word or syllable are uniformly labeled as filler, in present embodiment, above-mentioned speech frame Acoustic feature input convolutional neural networks voice wake up model, and be based on cross entropy criterion, pass through under a large amount of data The mode of stochastic gradient descent is trained, and final optimization pass obtains the corresponding depth nerve of acoustic feature information of above-mentioned speech frame Network class model.

Decision-making module 140 is waken up to enroll live voice data, it is special to extract the corresponding acoustics of the live voice data Reference breath, and the corresponding acoustic feature information input of the live voice data to the deep neural network is classified mould Type calculates the admission scene according to the posterior probability information to obtain the posterior probability information of the live voice data The confidence level of voice data, and by described compared with pre-set threshold, when the confidence level is greater than the pre-set threshold, wake-up Voice recording device., when the confidence level be less than the pre-set threshold, do not wake up voice recording device and further obtain use Family instruction

It wakes up decision-making module 140 and enrolls live voice data, which can be tested speech, or Real speech data, this is not limited by the present invention, and the extraction of acoustic feature information is carried out to the live voice data of the typing, And by the deep neural network disaggregated model in the corresponding acoustic feature information input above-mentioned steps of the live voice data, obtain The posterior probability letter of non-wake-up word class and wake-up word class that the corresponding acoustic feature information of the live voice data is included Breath.

Waking up decision-making module can also include denoising unit, to go to the corresponding acoustic feature information of the data It makes an uproar processing.Denoising, mode phase of the specific processing mode denoising from coding module 150 are carried out to admission live voice data Together, the present invention repeats no more this.

According to the distribution of the posterior probability information of live voice data in above-mentioned steps, and then obtain admission field speech number According to confidence level, and the confidence level is compared with pre-set threshold, obtains judging result, when wake up word confidence level be greater than it is default When determining threshold value, speech ciphering equipment is waken up, otherwise, speech ciphering equipment is not made accordingly.

Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.

The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection of the invention Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.

Claims

1. a kind of voice awakening method, which comprises the following steps:

The corresponding environmental audio data of scene applied by typing original audio data and acquisition voice recording device, according to environment The original audio data is converted environment speech simulation data by audio data；

Framing windowing operation is carried out to primary voice data and/or analog voice data, to obtain the raw tone and/or mould The quasi- corresponding speech frame of voice, and extract the acoustic feature information of the speech frame；

The acoustic feature information is calculated, to obtain the wake-up word class and non-wake-up that the speech frame is included at least The deep neural network disaggregated model of word class；

Live voice data is enrolled, extracts the corresponding acoustic feature information of the live voice data, and by the live language The corresponding acoustic feature information input of sound data is to the deep neural network disaggregated model, to obtain the live voice data Posterior probability information；

The confidence level of the admission live voice data is calculated according to the posterior probability information, and by the confidence level and is preset Threshold value comparison is determined, when the confidence level is greater than the pre-set threshold, wake-up voice recording device, when the confidence level is less than institute Pre-set threshold is stated, do not wake up voice recording device and further obtains user instruction.

2. voice awakening method according to claim 1, which is characterized in that in the typing original audio data and acquisition The corresponding environmental audio data of scene applied by voice recording device, according to environmental audio data by the original audio data In the step of being converted into environment speech simulation data, the environment speech simulation data include the noise mode to original audio data The one of them or multinomial that quasi-, word speed simulation, reverberation simulation, tone and loudness are simulated.

3. voice awakening method according to claim 1, which is characterized in that it is described to primary voice data and simulation language Sound data carry out framing windowing operation, to obtain the raw tone and/or the corresponding speech frame of analog voice, and described in extraction After the step of acoustic feature information of speech frame, further includes:

4. voice awakening method according to claim 1, which is characterized in that in the admission live voice data, extract The corresponding acoustic feature information of the live voice data, and the corresponding acoustic feature information of the live voice data is defeated Enter to the deep neural network disaggregated model, the step of to obtain the posterior probability information of the live voice data in, also Include:

5. a kind of voice wakes up system, which is characterized in that including following system:

Voice data analog module, it is corresponding to scene applied by typing original audio data and acquisition voice recording device Environmental audio data convert environment speech simulation data for the original audio data according to environmental audio data；

Characteristic extracting module, to carry out framing windowing operation to primary voice data and/or analog voice data, to obtain Raw tone and/or the corresponding speech frame of analog voice are stated, and extracts the acoustic feature information of the speech frame；

Depth network neural module is at least wrapped calculating the acoustic feature information with obtaining the speech frame What is contained wakes up word class and the non-deep neural network disaggregated model for waking up word class；

Decision-making module is waken up, enrolling live voice data, extracts the corresponding acoustic feature information of the live voice data, And by the corresponding acoustic feature information input of the live voice data to the deep neural network disaggregated model, to obtain The posterior probability information of the live voice data calculates the admission live voice data according to the posterior probability information Confidence level, and by described compared with pre-set threshold, when the confidence level is greater than the pre-set threshold, wake-up voice input is set It is standby, when the confidence level be less than the pre-set threshold, do not wake up voice recording device and further obtain user instruction.

6. voice according to claim 5 wakes up system, which is characterized in that in the typing original audio data and acquisition The corresponding environmental audio data of scene applied by voice recording device, according to environmental audio data by the original audio data In the step of being converted into environment speech simulation data, the environment speech simulation data include the noise mode to original audio data The one of them or multinomial that quasi-, word speed simulation, reverberation simulation, tone and loudness are simulated.

7. voice according to claim 5 wakes up system, which is characterized in that the system also includes:

8. voice according to claim 5 wakes up system, which is characterized in that wake up decision-making module further include: