CN109036412A - voice awakening method and system - Google Patents
voice awakening method and system Download PDFInfo
- Publication number
- CN109036412A CN109036412A CN201811081600.XA CN201811081600A CN109036412A CN 109036412 A CN109036412 A CN 109036412A CN 201811081600 A CN201811081600 A CN 201811081600A CN 109036412 A CN109036412 A CN 109036412A
- Authority
- CN
- China
- Prior art keywords
- data
- voice
- acoustic feature
- feature information
- audio data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 18
- 238000004088 simulation Methods 0.000 claims abstract description 35
- 238000013528 artificial neural network Methods 0.000 claims abstract description 24
- 239000000284 extract Substances 0.000 claims abstract description 18
- 238000009432 framing Methods 0.000 claims abstract description 11
- 230000007613 environmental effect Effects 0.000 claims description 21
- 238000000605 extraction Methods 0.000 claims description 7
- 230000002618 waking effect Effects 0.000 claims description 7
- 230000001537 neural effect Effects 0.000 claims description 4
- 238000012549 training Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 239000000945 filler Substances 0.000 description 2
- 210000005036 nerve Anatomy 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Telephonic Communication Services (AREA)
Abstract
The present invention relates to a kind of speech recognition awakening method and systems, and wherein method to obtain the corresponding speech frame of the raw tone, and extracts the acoustic feature information of the speech frame the following steps are included: to primary voice data framing windowing operation;It carries out the acoustic feature information that deep neural network disaggregated model is calculated;Live voice data is enrolled, extracts the corresponding acoustic feature information of the live voice data, and by the corresponding acoustic feature information input of the live voice data to the deep neural network disaggregated model, to obtain posterior probability information;And by described compared with pre-set threshold, when the confidence level is greater than the pre-set threshold, wake-up voice recording device.The above method effectively promotes the wake-up performance under noise scene;The simulation such as word speed, pitch, volume is carried out to initial data, effectively promotes wake-up system to the adaptability of different speakers.
Description
Technical field
The present invention relates to field of speech recognition, more particularly to a kind of voice awakening method and system.
Background technique
Voice awakening technology is an important branch in field of speech recognition, is widely used in mobile phone terminal, intelligence
In the voice interactive systems such as household, vehicle mounted guidance, user-friendly phonetic order wake-up device.More specifically, voice wakes up
The task of system is to detect some wake-up word predetermined automatically from the voice received incessantly from the background, generally
Also referred to as keyword detection (Keyword Spotting, KWS), when system detection is to corresponding keyword, equipment is called out
It wakes up, and enters specific working condition.
Currently, mainly evaluating the performance that a voice wakes up system using two indices: one is accidentally to refuse rate (False
Reject Rate, FRR), refer to that system will wake up the probability of word missing inspection;One is false alarm rate (False Alarm Rate, FAR),
Finger system is by non-wake-up word misrecognition at the probability for waking up word, also referred to as false wake-up rate.False wake-up rate generally also can be used separately
One index is measured, i.e., the false wake-up number occurred whithin a period of time, and such as 1 time/12 hours.Theoretically, rate and mistake are accidentally refused
Alert rate is a pair of conflicting index: accidentally refusing rate to reduce, false alarm rate is likely to rise;On the contrary, if being missed to reduce
Alert rate, accidentally the rate of refusing is also likely to rise.
One voice of good performance, which wakes up system and should be provided simultaneously with, lower accidentally refuses rate and lower false alarm rate: especially
It is in fields such as smart homes, excessively high false alarm rate will influence normal communication, leisure or the amusement of user to a certain extent, recruit
It applies the dislike at family;And on the other hand, under the complex scenes such as common far field, noise, the excessively high accidentally rate of refusing be will be greatly reduced
The actual use of intelligent sound equipment is experienced.How under the premise of controlling lower false alarm rate, reduce as far as possible various multiple
Rate is accidentally refused under miscellaneous scene, the robustness that wake-up system changes the word speed of speaker, accent is improved, is one urgently to be resolved
Problem.
Summary of the invention
Based on this, it is necessary under the premise of the lower false alarm rate of above-mentioned control, how to reduce various complicated fields as far as possible
The problem of accidentally refusing rate under scape, and how improving the robustness that wake-up system changes the word speed of speaker, accent, provides one kind
Voice awakening method and system.
A kind of voice awakening method, comprising the following steps:
The corresponding environmental audio data of scene applied by typing original audio data and acquisition voice recording device, according to
The original audio data is converted environment speech simulation data by environmental audio data;
Framing windowing operation is carried out to primary voice data and/or analog voice data, with obtain the raw tone and/
Or the corresponding speech frame of analog voice, and extract the acoustic feature information of the speech frame;
The acoustic feature information is calculated, to obtain wake-up word class that the speech frame is included at least and non-
Wake up the deep neural network disaggregated model of word class;
Live voice data is enrolled, extracts the corresponding acoustic feature information of the live voice data, and this is described existing
Voice data corresponding acoustic feature information input in field is to the deep neural network disaggregated model, to obtain the field speech
The posterior probability information of data;
According to the posterior probability information calculate it is described admission live voice data confidence level, and by the confidence level with
Pre-set threshold compares, when the confidence level is greater than the pre-set threshold, wake-up voice recording device, when the confidence level is small
In the pre-set threshold, voice recording device is not waken up and further obtains user instruction.
In a wherein preferred embodiment, answered in the typing original audio data and acquisition voice recording device
The corresponding environmental audio data of scene convert environment voice mould for the original audio data according to environmental audio data
In the step of quasi- data, the environment speech simulation data include the noise simulation to original audio data, word speed simulation, reverberation
The one of them or multinomial of simulation, tone and loudness simulation.
In a wherein preferred embodiment, primary voice data and the progress framing of analog voice data are added described
Window operation, to obtain the raw tone and/or the corresponding speech frame of analog voice, and extracts the acoustic feature of the speech frame
After the step of information, further includes:
Denoising is carried out to the acoustic feature information of the speech frame.
In a wherein preferred embodiment, in the admission live voice data, the live voice data is extracted
Corresponding acoustic feature information, and by the corresponding acoustic feature information input of the live voice data to the depth nerve
Network class model, the step of to obtain the posterior probability information of the live voice data in, further includes:
Denoising is carried out to the corresponding acoustic feature information of the field data.
Above-mentioned speech recognition awakening method in present embodiment can effectively promote the wake-up performance under noise scene, solution
Robustness problem of the system in word speed, the accent variation of speaker is certainly waken up, the reality of intelligent sound equipment is substantially improved
Usage experience.The simulation such as word speed, pitch, volume is carried out to initial data, wake-up system is effectively promoted and different speakers is fitted
Ying Xing.
A kind of voice wake-up system, comprising:
Voice data analog module to typing original audio data and obtains scene pair applied by voice recording device
The environmental audio data answered convert environment speech simulation data for the original audio data according to environmental audio data;
Characteristic extracting module, to primary voice data and/or analog voice data framing windowing operation, to obtain
Raw tone and/or the corresponding speech frame of analog voice are stated, and extracts the acoustic feature information of the speech frame;
Depth network neural module, to calculate the acoustic feature information, to obtain the speech frame institute extremely
Include less wakes up word class and the non-deep neural network disaggregated model for waking up word class;
It wakes up decision-making module and extracts the corresponding acoustic feature of the live voice data to enroll live voice data
Information, and by the corresponding acoustic feature information input of the live voice data to the deep neural network disaggregated model,
To obtain the posterior probability information of the live voice data, the admission field speech is calculated according to the posterior probability information
Data confidence, and by the confidence level compared with pre-set threshold, when the confidence level is greater than the pre-set threshold, wake-up
Voice recording device.
In a wherein preferred embodiment, answered in the typing original audio data and acquisition voice recording device
The corresponding environmental audio data of scene convert environment voice mould for the original audio data according to environmental audio data
In the step of quasi- data, the environment speech simulation data include the noise simulation to original audio data, word speed simulation, reverberation
The one of them or multinomial of simulation, tone and loudness simulation.
In a wherein preferred embodiment, the system also includes:
Denoising carries out denoising from coding module, to the acoustic feature information to the speech frame.
In a wherein preferred embodiment, decision-making module is waken up further include:
Unit is denoised, to carry out denoising to the corresponding acoustic feature information of the data.
Above-mentioned speech recognition in present embodiment, which wakes up system, can effectively promote the wake-up performance under noise scene, solve
Robustness problem of the system in word speed, the accent variation of speaker is certainly waken up, the reality of intelligent sound equipment is substantially improved
Usage experience.The simulation such as word speed, pitch, volume is carried out to initial data, wake-up system is effectively promoted and different speakers is fitted
Ying Xing.
Detailed description of the invention
Fig. 1 is a kind of flow chart of voice awakening method of a preferred embodiment of the invention;
Fig. 2 is that a kind of voice of a preferred embodiment of the invention wakes up the module diagram of system.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right
The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and
It is not used in the restriction present invention.
It should be noted that it can directly on the other element when element is referred to as " being set to " another element
Or there may also be elements placed in the middle.When an element is considered as " connection " another element, it, which can be, is directly connected to
To another element or it may be simultaneously present centering elements.Term as used herein " vertical ", " horizontal ", " left side ",
" right side " and similar statement for illustrative purposes only, are not meant to be the only embodiment.
Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention
The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool
The purpose of the embodiment of body, it is not intended that in the limitation present invention.Term " and or " used herein includes one or more phases
Any and all combinations of the listed item of pass.
As shown in Figure 1, a kind of voice awakening method of a preferred embodiment of the invention, method includes the following steps:
S10: the corresponding environmental audio data of scene applied by typing original audio data and acquisition voice recording device,
Environment speech simulation data are converted by the original audio data according to environmental audio data.
In this step, the original audio data that operator can be original, clean by the typing in voice recording device,
And the environmental factors such as noise, word speed, reverberation, tone and the loudness of place scene of the voice recording device are simulated, it will
Above-mentioned original audio data is converted into environment speech simulation data.
S20: to primary voice data and/or analog voice data framing windowing operation, with obtain the raw tone and/
Or the corresponding speech frame of analog voice, and extract the acoustic feature information of the speech frame;
Extracting corresponding voice to primary voice data and/or analog voice data in the way of framing adding window
Frame, and extract the acoustic feature information of the speech frame.
Then feature extraction carried out to above-mentioned each speech frame, in present embodiment, above-mentioned phonetic feature can be filtering
Device group (filter bank, i.e. fbank), or other phonetic features, this is not limited by the present invention.
S30: denoising is carried out to the acoustic feature information of the speech frame.
In this step, denoising is carried out to the acoustic feature information of above-mentioned speech frame.Specifically, noise simulation voice
The corresponding raw tone feature of feature can be used for training denoising self-encoding encoder: present embodiment uses full Connection Neural Network
Construction denoising self-encoding encoder, according to the operational capability of system, usually using layer 2-3 hidden layer network, every layer includes 256 or 512
Node, and according to mean square error (Mean-Square Error, MSE) minimize criterion, by the way of stochastic gradient descent into
The training of row denoising self-encoding encoder.
S40: the acoustic feature information is calculated, to obtain the wake-up word class that the speech frame is included at least
And the non-deep neural network disaggregated model for waking up word class;
It is original using the generation of large vocabulary Continuous Speech Recognition System firstly, for the acoustic feature information of above-mentioned speech frame
Audio data and corresponding pressure alignment information (phoneme level or syllable grade) of environment speech simulation data, and non-wake-up word is relevant
Phoneme or syllable are uniformly labeled as filler, and in present embodiment, the acoustic feature of the above-mentioned speech frame of above-mentioned steps inputs convolution
The voice of neural network wakes up model, and is based on cross entropy criterion, under a large amount of data by way of stochastic gradient descent
It is trained, final optimization pass obtains the corresponding deep neural network disaggregated model of acoustic feature information of above-mentioned speech frame.
In addition to above-mentioned convolutional neural networks, above-mentioned depth network class model can also be full Connection Neural Network, time delay
Neural network etc..
S50: admission live voice data extracts the corresponding acoustic feature information of the live voice data, and by the institute
The corresponding acoustic feature information input of live voice data is stated to the deep neural network disaggregated model, to obtain the scene
The posterior probability information of voice data;
In this step, live voice data is enrolled, which can be tested speech, or true language
Sound data, this is not limited by the present invention, the extraction of acoustic feature information is carried out to the live voice data of the typing, and should
Deep neural network disaggregated model in the corresponding acoustic feature information input above-mentioned steps of live voice data, obtains the scene
The posterior probability information of non-wake-up word class and wake-up word class that the corresponding acoustic feature information of voice data is included;
It can also include that denoising is carried out to admission live voice data, specific processing mode and S30 are walked in this step
Identical to the mode of the denoising of the first acoustic feature information in rapid, the present invention repeats no more this.
S60: the admission live voice data is calculated in depth network class model according to the posterior probability information
Confidence level when the confidence level is greater than the pre-set threshold, wake up voice input and by described compared with pre-set threshold
Equipment.
In this step, according to the distribution of the posterior probability information of live voice data in above-mentioned steps, and then enrolled
Live voice data confidence level, and the confidence level is compared with pre-set threshold, judging result is obtained, when wake-up word confidence level
When greater than pre-set threshold, wake up speech ciphering equipment, otherwise, when the confidence level be less than the pre-set threshold, do not wake up voice
Recording device simultaneously further obtains user instruction.
Above-mentioned voice awakening method in present embodiment can effectively promote the wake-up performance under noise scene, solve to call out
Robustness problem of the system of waking up in word speed, the accent variation of speaker, substantially improves the actual use of intelligent sound equipment
Experience.The simulation such as word speed, pitch, volume is carried out to initial data, effectively promotes wake-up system to the adaptability of different speakers.
As shown in Fig. 2, another preferred embodiment of the present invention, which discloses a kind of speech recognition, wakes up system described in system 100
Including voice data analog module 110, characteristic extracting module 120, deep neural network module 130, wake up decision-making module 140.
Above-mentioned voice data analog module 110 is to typing original audio data and obtains applied by voice recording device
The corresponding environmental audio data of scene convert environment speech simulation number for the original audio data according to environmental audio data
According to.
Operator can be original, clean by 110 typing of voice data analog module original audio data, and to this
The environmental factors such as noise, word speed, reverberation, tone and the loudness of place scene of voice recording device are simulated, by above-mentioned original
Beginning audio data is converted into environment speech simulation data.
Features described above extraction module 120 to primary voice data and/or analog voice data framing windowing operation, with
The raw tone and/or the corresponding speech frame of analog voice are obtained, and extracts the acoustic feature information of the speech frame.
Features described above extraction module 120 in the way of framing adding window to primary voice data and/or analog voice
Data extract corresponding speech frame, and extract the acoustic feature information of the speech frame.Then to above-mentioned each speech frame into
Row feature extraction, in present embodiment, above-mentioned phonetic feature can be filter group (filter bank, i.e. fbank), can also
Think other phonetic features, this is not limited by the present invention.
This system can also include denoising from coding module 150, and denoising is from coding module 150 to the speech frame
Acoustic feature information carries out denoising.
It denoises from acoustic feature information of the coding module to above-mentioned speech frame and carries out denoising.Specifically, pass through noise
The corresponding raw tone feature of the feature of analog voice can be used for training denoising self-encoding encoder: present embodiment uses to be connected entirely
Neural network configuration denoising self-encoding encoder is connect, according to the operational capability of system, usually using layer 2-3 hidden layer network, every layer includes
256 or 512 nodes, and criterion is minimized according to mean square error (Mean-Square Error, MSE), using under stochastic gradient
The mode of drop carries out the training of denoising self-encoding encoder.
Depth network neural module 130 is to calculate the acoustic feature information, to obtain the speech frame institute
What is included at least wakes up word class and the non-deep neural network disaggregated model for waking up word class.
Firstly, depth network neural module 130 is continuous using large vocabulary for the acoustic feature information of above-mentioned speech frame
Speech recognition system generates original audio data and corresponding pressure alignment information (phoneme level or the syllable of environment speech simulation data
Grade), and the relevant phoneme of non-wake-up word or syllable are uniformly labeled as filler, in present embodiment, above-mentioned speech frame
Acoustic feature input convolutional neural networks voice wake up model, and be based on cross entropy criterion, pass through under a large amount of data
The mode of stochastic gradient descent is trained, and final optimization pass obtains the corresponding depth nerve of acoustic feature information of above-mentioned speech frame
Network class model.
Decision-making module 140 is waken up to enroll live voice data, it is special to extract the corresponding acoustics of the live voice data
Reference breath, and the corresponding acoustic feature information input of the live voice data to the deep neural network is classified mould
Type calculates the admission scene according to the posterior probability information to obtain the posterior probability information of the live voice data
The confidence level of voice data, and by described compared with pre-set threshold, when the confidence level is greater than the pre-set threshold, wake-up
Voice recording device., when the confidence level be less than the pre-set threshold, do not wake up voice recording device and further obtain use
Family instruction
It wakes up decision-making module 140 and enrolls live voice data, which can be tested speech, or
Real speech data, this is not limited by the present invention, and the extraction of acoustic feature information is carried out to the live voice data of the typing,
And by the deep neural network disaggregated model in the corresponding acoustic feature information input above-mentioned steps of the live voice data, obtain
The posterior probability letter of non-wake-up word class and wake-up word class that the corresponding acoustic feature information of the live voice data is included
Breath.
Waking up decision-making module can also include denoising unit, to go to the corresponding acoustic feature information of the data
It makes an uproar processing.Denoising, mode phase of the specific processing mode denoising from coding module 150 are carried out to admission live voice data
Together, the present invention repeats no more this.
According to the distribution of the posterior probability information of live voice data in above-mentioned steps, and then obtain admission field speech number
According to confidence level, and the confidence level is compared with pre-set threshold, obtains judging result, when wake up word confidence level be greater than it is default
When determining threshold value, speech ciphering equipment is waken up, otherwise, speech ciphering equipment is not made accordingly.
Above-mentioned speech recognition in present embodiment, which wakes up system, can effectively promote the wake-up performance under noise scene, solve
Robustness problem of the system in word speed, the accent variation of speaker is certainly waken up, the reality of intelligent sound equipment is substantially improved
Usage experience.The simulation such as word speed, pitch, volume is carried out to initial data, wake-up system is effectively promoted and different speakers is fitted
Ying Xing.
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality
It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited
In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously
It cannot therefore be construed as limiting the scope of the patent.It should be pointed out that coming for those of ordinary skill in the art
It says, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to protection of the invention
Range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
Claims (8)
1. a kind of voice awakening method, which comprises the following steps:
The corresponding environmental audio data of scene applied by typing original audio data and acquisition voice recording device, according to environment
The original audio data is converted environment speech simulation data by audio data;
Framing windowing operation is carried out to primary voice data and/or analog voice data, to obtain the raw tone and/or mould
The quasi- corresponding speech frame of voice, and extract the acoustic feature information of the speech frame;
The acoustic feature information is calculated, to obtain the wake-up word class and non-wake-up that the speech frame is included at least
The deep neural network disaggregated model of word class;
Live voice data is enrolled, extracts the corresponding acoustic feature information of the live voice data, and by the live language
The corresponding acoustic feature information input of sound data is to the deep neural network disaggregated model, to obtain the live voice data
Posterior probability information;
The confidence level of the admission live voice data is calculated according to the posterior probability information, and by the confidence level and is preset
Threshold value comparison is determined, when the confidence level is greater than the pre-set threshold, wake-up voice recording device, when the confidence level is less than institute
Pre-set threshold is stated, do not wake up voice recording device and further obtains user instruction.
2. voice awakening method according to claim 1, which is characterized in that in the typing original audio data and acquisition
The corresponding environmental audio data of scene applied by voice recording device, according to environmental audio data by the original audio data
In the step of being converted into environment speech simulation data, the environment speech simulation data include the noise mode to original audio data
The one of them or multinomial that quasi-, word speed simulation, reverberation simulation, tone and loudness are simulated.
3. voice awakening method according to claim 1, which is characterized in that it is described to primary voice data and simulation language
Sound data carry out framing windowing operation, to obtain the raw tone and/or the corresponding speech frame of analog voice, and described in extraction
After the step of acoustic feature information of speech frame, further includes:
Denoising is carried out to the acoustic feature information of the speech frame.
4. voice awakening method according to claim 1, which is characterized in that in the admission live voice data, extract
The corresponding acoustic feature information of the live voice data, and the corresponding acoustic feature information of the live voice data is defeated
Enter to the deep neural network disaggregated model, the step of to obtain the posterior probability information of the live voice data in, also
Include:
Denoising is carried out to the corresponding acoustic feature information of the field data.
5. a kind of voice wakes up system, which is characterized in that including following system:
Voice data analog module, it is corresponding to scene applied by typing original audio data and acquisition voice recording device
Environmental audio data convert environment speech simulation data for the original audio data according to environmental audio data;
Characteristic extracting module, to carry out framing windowing operation to primary voice data and/or analog voice data, to obtain
Raw tone and/or the corresponding speech frame of analog voice are stated, and extracts the acoustic feature information of the speech frame;
Depth network neural module is at least wrapped calculating the acoustic feature information with obtaining the speech frame
What is contained wakes up word class and the non-deep neural network disaggregated model for waking up word class;
Decision-making module is waken up, enrolling live voice data, extracts the corresponding acoustic feature information of the live voice data,
And by the corresponding acoustic feature information input of the live voice data to the deep neural network disaggregated model, to obtain
The posterior probability information of the live voice data calculates the admission live voice data according to the posterior probability information
Confidence level, and by described compared with pre-set threshold, when the confidence level is greater than the pre-set threshold, wake-up voice input is set
It is standby, when the confidence level be less than the pre-set threshold, do not wake up voice recording device and further obtain user instruction.
6. voice according to claim 5 wakes up system, which is characterized in that in the typing original audio data and acquisition
The corresponding environmental audio data of scene applied by voice recording device, according to environmental audio data by the original audio data
In the step of being converted into environment speech simulation data, the environment speech simulation data include the noise mode to original audio data
The one of them or multinomial that quasi-, word speed simulation, reverberation simulation, tone and loudness are simulated.
7. voice according to claim 5 wakes up system, which is characterized in that the system also includes:
Denoising carries out denoising from coding module, to the acoustic feature information to the speech frame.
8. voice according to claim 5 wakes up system, which is characterized in that wake up decision-making module further include:
Unit is denoised, to carry out denoising to the corresponding acoustic feature information of the data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811081600.XA CN109036412A (en) | 2018-09-17 | 2018-09-17 | voice awakening method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811081600.XA CN109036412A (en) | 2018-09-17 | 2018-09-17 | voice awakening method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109036412A true CN109036412A (en) | 2018-12-18 |
Family
ID=64622013
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811081600.XA Pending CN109036412A (en) | 2018-09-17 | 2018-09-17 | voice awakening method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109036412A (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109801629A (en) * | 2019-03-01 | 2019-05-24 | 珠海格力电器股份有限公司 | A kind of sound control method, device, storage medium and air-conditioning |
CN109886386A (en) * | 2019-01-30 | 2019-06-14 | 北京声智科技有限公司 | Wake up the determination method and device of model |
CN110223708A (en) * | 2019-05-07 | 2019-09-10 | 平安科技(深圳)有限公司 | Sound enhancement method and relevant device based on speech processes |
CN110534102A (en) * | 2019-09-19 | 2019-12-03 | 北京声智科技有限公司 | A kind of voice awakening method, device, equipment and medium |
CN110767231A (en) * | 2019-09-19 | 2020-02-07 | 平安科技(深圳)有限公司 | Voice control equipment awakening word identification method and device based on time delay neural network |
CN110838289A (en) * | 2019-11-14 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Awakening word detection method, device, equipment and medium based on artificial intelligence |
CN111081217A (en) * | 2019-12-03 | 2020-04-28 | 珠海格力电器股份有限公司 | Voice wake-up method and device, electronic equipment and storage medium |
CN111833869A (en) * | 2020-07-01 | 2020-10-27 | 中关村科学城城市大脑股份有限公司 | Voice interaction method and system applied to urban brain |
WO2020228815A1 (en) * | 2019-05-16 | 2020-11-19 | 华为技术有限公司 | Voice-based wakeup method and device |
CN112825250A (en) * | 2019-11-20 | 2021-05-21 | 芋头科技(杭州)有限公司 | Voice wake-up method, apparatus, storage medium and program product |
CN112992189A (en) * | 2021-01-29 | 2021-06-18 | 青岛海尔科技有限公司 | Voice audio detection method and device, storage medium and electronic device |
CN113593560A (en) * | 2021-07-29 | 2021-11-02 | 普强时代(珠海横琴)信息技术有限公司 | Customizable low-delay command word recognition method and device |
CN113782016A (en) * | 2021-08-06 | 2021-12-10 | 佛山市顺德区美的电子科技有限公司 | Wake-up processing method, device, equipment and computer storage medium |
WO2023029615A1 (en) * | 2021-08-30 | 2023-03-09 | 华为技术有限公司 | Wake-on-voice method and apparatus, device, storage medium, and program product |
Citations (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514879A (en) * | 2013-09-18 | 2014-01-15 | 广东欧珀移动通信有限公司 | Local voice recognition method based on BP neural network |
CN105096939A (en) * | 2015-07-08 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice wake-up method and device |
CN105448303A (en) * | 2015-11-27 | 2016-03-30 | 百度在线网络技术(北京)有限公司 | Voice signal processing method and apparatus |
CN105632486A (en) * | 2015-12-23 | 2016-06-01 | 北京奇虎科技有限公司 | Voice wake-up method and device of intelligent hardware |
CN106057192A (en) * | 2016-07-07 | 2016-10-26 | Tcl集团股份有限公司 | Real-time voice conversion method and apparatus |
CN106127217A (en) * | 2015-05-07 | 2016-11-16 | 西门子保健有限责任公司 | The method and system that neutral net detects is goed deep into for anatomical object for approximation |
CN106297779A (en) * | 2016-07-28 | 2017-01-04 | 块互动(北京)科技有限公司 | A kind of background noise removing method based on positional information and device |
CN106328126A (en) * | 2016-10-20 | 2017-01-11 | 北京云知声信息技术有限公司 | Far-field speech recognition processing method and device |
CN106611599A (en) * | 2015-10-21 | 2017-05-03 | 展讯通信(上海)有限公司 | Voice recognition method and device based on artificial neural network and electronic equipment |
CN106611598A (en) * | 2016-12-28 | 2017-05-03 | 上海智臻智能网络科技股份有限公司 | VAD dynamic parameter adjusting method and device |
CN106683663A (en) * | 2015-11-06 | 2017-05-17 | 三星电子株式会社 | Neural network training apparatus and method, and speech recognition apparatus and method |
CN106782536A (en) * | 2016-12-26 | 2017-05-31 | 北京云知声信息技术有限公司 | A kind of voice awakening method and device |
CN106940998A (en) * | 2015-12-31 | 2017-07-11 | 阿里巴巴集团控股有限公司 | A kind of execution method and device of setting operation |
CN107123417A (en) * | 2017-05-16 | 2017-09-01 | 上海交通大学 | Optimization method and system are waken up based on the customized voice that distinctive is trained |
CN107134279A (en) * | 2017-06-30 | 2017-09-05 | 百度在线网络技术(北京)有限公司 | A kind of voice awakening method, device, terminal and storage medium |
CN107945788A (en) * | 2017-11-27 | 2018-04-20 | 桂林电子科技大学 | A kind of relevant Oral English Practice pronunciation error detection of text and quality score method |
CN108242234A (en) * | 2018-01-10 | 2018-07-03 | 腾讯科技(深圳)有限公司 | Speech recognition modeling generation method and its equipment, storage medium, electronic equipment |
CN108320733A (en) * | 2017-12-18 | 2018-07-24 | 上海科大讯飞信息科技有限公司 | Voice data processing method and device, storage medium, electronic equipment |
CN108335702A (en) * | 2018-02-01 | 2018-07-27 | 福州大学 | A kind of audio defeat method based on deep neural network |
CN108494710A (en) * | 2018-03-30 | 2018-09-04 | 中南民族大学 | Visible light communication MIMO anti-interference noise-reduction methods based on BP neural network |
-
2018
- 2018-09-17 CN CN201811081600.XA patent/CN109036412A/en active Pending
Patent Citations (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103514879A (en) * | 2013-09-18 | 2014-01-15 | 广东欧珀移动通信有限公司 | Local voice recognition method based on BP neural network |
CN106127217A (en) * | 2015-05-07 | 2016-11-16 | 西门子保健有限责任公司 | The method and system that neutral net detects is goed deep into for anatomical object for approximation |
CN105096939A (en) * | 2015-07-08 | 2015-11-25 | 百度在线网络技术(北京)有限公司 | Voice wake-up method and device |
CN106611599A (en) * | 2015-10-21 | 2017-05-03 | 展讯通信(上海)有限公司 | Voice recognition method and device based on artificial neural network and electronic equipment |
CN106683663A (en) * | 2015-11-06 | 2017-05-17 | 三星电子株式会社 | Neural network training apparatus and method, and speech recognition apparatus and method |
CN105448303A (en) * | 2015-11-27 | 2016-03-30 | 百度在线网络技术(北京)有限公司 | Voice signal processing method and apparatus |
CN105632486A (en) * | 2015-12-23 | 2016-06-01 | 北京奇虎科技有限公司 | Voice wake-up method and device of intelligent hardware |
CN106940998A (en) * | 2015-12-31 | 2017-07-11 | 阿里巴巴集团控股有限公司 | A kind of execution method and device of setting operation |
CN106057192A (en) * | 2016-07-07 | 2016-10-26 | Tcl集团股份有限公司 | Real-time voice conversion method and apparatus |
CN106297779A (en) * | 2016-07-28 | 2017-01-04 | 块互动(北京)科技有限公司 | A kind of background noise removing method based on positional information and device |
CN106328126A (en) * | 2016-10-20 | 2017-01-11 | 北京云知声信息技术有限公司 | Far-field speech recognition processing method and device |
CN106782536A (en) * | 2016-12-26 | 2017-05-31 | 北京云知声信息技术有限公司 | A kind of voice awakening method and device |
CN106611598A (en) * | 2016-12-28 | 2017-05-03 | 上海智臻智能网络科技股份有限公司 | VAD dynamic parameter adjusting method and device |
CN107123417A (en) * | 2017-05-16 | 2017-09-01 | 上海交通大学 | Optimization method and system are waken up based on the customized voice that distinctive is trained |
CN107134279A (en) * | 2017-06-30 | 2017-09-05 | 百度在线网络技术(北京)有限公司 | A kind of voice awakening method, device, terminal and storage medium |
CN107945788A (en) * | 2017-11-27 | 2018-04-20 | 桂林电子科技大学 | A kind of relevant Oral English Practice pronunciation error detection of text and quality score method |
CN108320733A (en) * | 2017-12-18 | 2018-07-24 | 上海科大讯飞信息科技有限公司 | Voice data processing method and device, storage medium, electronic equipment |
CN108242234A (en) * | 2018-01-10 | 2018-07-03 | 腾讯科技(深圳)有限公司 | Speech recognition modeling generation method and its equipment, storage medium, electronic equipment |
CN108335702A (en) * | 2018-02-01 | 2018-07-27 | 福州大学 | A kind of audio defeat method based on deep neural network |
CN108494710A (en) * | 2018-03-30 | 2018-09-04 | 中南民族大学 | Visible light communication MIMO anti-interference noise-reduction methods based on BP neural network |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109886386A (en) * | 2019-01-30 | 2019-06-14 | 北京声智科技有限公司 | Wake up the determination method and device of model |
CN109801629A (en) * | 2019-03-01 | 2019-05-24 | 珠海格力电器股份有限公司 | A kind of sound control method, device, storage medium and air-conditioning |
CN110223708A (en) * | 2019-05-07 | 2019-09-10 | 平安科技(深圳)有限公司 | Sound enhancement method and relevant device based on speech processes |
CN110223708B (en) * | 2019-05-07 | 2023-05-30 | 平安科技(深圳)有限公司 | Speech enhancement method based on speech processing and related equipment |
WO2020228815A1 (en) * | 2019-05-16 | 2020-11-19 | 华为技术有限公司 | Voice-based wakeup method and device |
CN110534102A (en) * | 2019-09-19 | 2019-12-03 | 北京声智科技有限公司 | A kind of voice awakening method, device, equipment and medium |
CN110534102B (en) * | 2019-09-19 | 2020-10-30 | 北京声智科技有限公司 | Voice wake-up method, device, equipment and medium |
CN110767231A (en) * | 2019-09-19 | 2020-02-07 | 平安科技(深圳)有限公司 | Voice control equipment awakening word identification method and device based on time delay neural network |
US11848008B2 (en) | 2019-11-14 | 2023-12-19 | Tencent Technology (Shenzhen) Company Limited | Artificial intelligence-based wakeup word detection method and apparatus, device, and medium |
CN110838289B (en) * | 2019-11-14 | 2023-08-11 | 腾讯科技(深圳)有限公司 | Wake-up word detection method, device, equipment and medium based on artificial intelligence |
CN110838289A (en) * | 2019-11-14 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Awakening word detection method, device, equipment and medium based on artificial intelligence |
WO2021093449A1 (en) * | 2019-11-14 | 2021-05-20 | 腾讯科技(深圳)有限公司 | Wakeup word detection method and apparatus employing artificial intelligence, device, and medium |
CN112825250A (en) * | 2019-11-20 | 2021-05-21 | 芋头科技(杭州)有限公司 | Voice wake-up method, apparatus, storage medium and program product |
CN111081217B (en) * | 2019-12-03 | 2021-06-04 | 珠海格力电器股份有限公司 | Voice wake-up method and device, electronic equipment and storage medium |
CN111081217A (en) * | 2019-12-03 | 2020-04-28 | 珠海格力电器股份有限公司 | Voice wake-up method and device, electronic equipment and storage medium |
CN111833869B (en) * | 2020-07-01 | 2022-02-11 | 中关村科学城城市大脑股份有限公司 | Voice interaction method and system applied to urban brain |
CN111833869A (en) * | 2020-07-01 | 2020-10-27 | 中关村科学城城市大脑股份有限公司 | Voice interaction method and system applied to urban brain |
CN112992189B (en) * | 2021-01-29 | 2022-05-03 | 青岛海尔科技有限公司 | Voice audio detection method and device, storage medium and electronic device |
CN112992189A (en) * | 2021-01-29 | 2021-06-18 | 青岛海尔科技有限公司 | Voice audio detection method and device, storage medium and electronic device |
CN113593560A (en) * | 2021-07-29 | 2021-11-02 | 普强时代(珠海横琴)信息技术有限公司 | Customizable low-delay command word recognition method and device |
CN113593560B (en) * | 2021-07-29 | 2024-04-16 | 普强时代(珠海横琴)信息技术有限公司 | Customizable low-delay command word recognition method and device |
CN113782016A (en) * | 2021-08-06 | 2021-12-10 | 佛山市顺德区美的电子科技有限公司 | Wake-up processing method, device, equipment and computer storage medium |
CN113782016B (en) * | 2021-08-06 | 2023-05-05 | 佛山市顺德区美的电子科技有限公司 | Wakeup processing method, wakeup processing device, equipment and computer storage medium |
WO2023029615A1 (en) * | 2021-08-30 | 2023-03-09 | 华为技术有限公司 | Wake-on-voice method and apparatus, device, storage medium, and program product |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109036412A (en) | voice awakening method and system | |
CN106098059B (en) | Customizable voice awakening method and system | |
CN109326299B (en) | Speech enhancement method, device and storage medium based on full convolution neural network | |
CN106504768B (en) | Phone testing audio frequency classification method and device based on artificial intelligence | |
CN110970018B (en) | Speech recognition method and device | |
KR20060022156A (en) | Distributed speech recognition system and method | |
CN103377651B (en) | The automatic synthesizer of voice and method | |
CN103456305A (en) | Terminal and speech processing method based on multiple sound collecting units | |
CN104538043A (en) | Real-time emotion reminder for call | |
CN110930976A (en) | Voice generation method and device | |
CN110600008A (en) | Voice wake-up optimization method and system | |
CN105895082A (en) | Acoustic model training method and device as well as speech recognition method and device | |
CN112581938B (en) | Speech breakpoint detection method, device and equipment based on artificial intelligence | |
CN109410956A (en) | A kind of object identifying method of audio data, device, equipment and storage medium | |
CN112328994A (en) | Voiceprint data processing method and device, electronic equipment and storage medium | |
CN103811000A (en) | Voice recognition system and voice recognition method | |
CN113763966B (en) | End-to-end text irrelevant voiceprint recognition method and system | |
CN105845131A (en) | Far-talking voice recognition method and device | |
CN116705071A (en) | Playback voice detection method based on data enhancement and pre-training model feature extraction | |
CN113099043A (en) | Customer service control method, apparatus and computer-readable storage medium | |
CN115762500A (en) | Voice processing method, device, equipment and storage medium | |
CN115472174A (en) | Sound noise reduction method and device, electronic equipment and storage medium | |
CN114333912A (en) | Voice activation detection method and device, electronic equipment and storage medium | |
CN103533193B (en) | Residual echo elimination method and device | |
CN117636909B (en) | Data processing method, device, equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181218 |
|
RJ01 | Rejection of invention patent application after publication |