CN110364143A - Voice awakening method, device and its intelligent electronic device - Google Patents
Voice awakening method, device and its intelligent electronic device Download PDFInfo
- Publication number
- CN110364143A CN110364143A CN201910747867.6A CN201910747867A CN110364143A CN 110364143 A CN110364143 A CN 110364143A CN 201910747867 A CN201910747867 A CN 201910747867A CN 110364143 A CN110364143 A CN 110364143A
- Authority
- CN
- China
- Prior art keywords
- voice
- keyword
- data
- wake
- frequency characteristics
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 20
- 238000001514 detection method Methods 0.000 claims description 53
- 238000013528 artificial neural network Methods 0.000 claims description 33
- 230000015654 memory Effects 0.000 claims description 26
- 238000012545 processing Methods 0.000 claims description 26
- 238000012549 training Methods 0.000 claims description 9
- 230000002618 waking effect Effects 0.000 claims description 7
- 241001269238 Data Species 0.000 claims description 5
- 238000013075 data extraction Methods 0.000 claims description 4
- 230000005611 electricity Effects 0.000 claims description 2
- 238000004364 calculation method Methods 0.000 description 21
- 238000005516 engineering process Methods 0.000 description 21
- 238000010586 diagram Methods 0.000 description 16
- 230000009977 dual effect Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000009499 grossing Methods 0.000 description 5
- 238000006243 chemical reaction Methods 0.000 description 4
- 210000002569 neuron Anatomy 0.000 description 4
- 230000004044 response Effects 0.000 description 4
- 238000011282 treatment Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 210000004027 cell Anatomy 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000008447 perception Effects 0.000 description 2
- 241000209140 Triticum Species 0.000 description 1
- 235000021307 Triticum Nutrition 0.000 description 1
- 230000009471 action Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 229910052782 aluminium Inorganic materials 0.000 description 1
- XAGFODPZIPBFFR-UHFFFAOYSA-N aluminium Chemical compound [Al] XAGFODPZIPBFFR-UHFFFAOYSA-N 0.000 description 1
- 239000004411 aluminium Substances 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 239000013078 crystal Substances 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000005520 electrodynamics Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- VNWKTOKETHGBQD-UHFFFAOYSA-N methane Chemical compound C VNWKTOKETHGBQD-UHFFFAOYSA-N 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000004218 nerve net Anatomy 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000010287 polarization Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/223—Execution procedure of a spoken command
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
- G10L2015/225—Feedback of the input speech
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Telephonic Communication Services (AREA)
- User Interface Of Digital Computer (AREA)
Abstract
Disclose a kind of voice awakening method, device and its intelligent electronic device based on artificial intelligence.The voice awakening method includes: the audio frequency characteristics set for obtaining voice data;Based on the audio frequency characteristics set, detects voice and wake up keyword;And in the case where detecting that voice wakes up keyword, wake-up judgement is carried out to the audio frequency characteristics set using two sorter networks.
Description
Technical field
This disclosure relates to field of speech recognition, relate more specifically to a kind of voice awakening method based on artificial intelligence, dress
It sets and its intelligent electronic device.
Background technique
Voice wake-up refers to user and is interacted by voice with electronic equipment and realize electronic equipment from dormant state
To the conversion of state of activation.Currently, network is often detected using relatively simple wake-up in the electronic equipment of low cost,
False wake-up rate is relatively high.On the other hand, it in order to provide higher wake-up detection accuracy, then needs using complicated wake-up detection
Network, this proposes requirements at the higher level to the computing capability of electronic equipment, can not be commonly used in various electronic equipments.
Summary of the invention
Embodiment of the disclosure provides voice awakening method, device and its intelligent electronic device based on artificial intelligence.
Embodiment of the disclosure provides a kind of voice awakening method based on artificial intelligence comprising: obtain voice number
According to audio frequency characteristics set;Based on the audio frequency characteristics set, detects voice and wake up keyword;And it is closed detecting that voice wakes up
In the case where keyword, wake-up judgement is carried out to the audio frequency characteristics set using two sorter networks.
Embodiment of the disclosure additionally provides a kind of voice Rouser comprising: speech data extraction module, for obtaining
Take the audio frequency characteristics set of voice data;First processing module, for being based on the audio frequency characteristics set, detection voice wakes up crucial
Word;And Second processing module, for detect voice wake up keyword in the case where, using two sorter networks to the audio spy
Collection closes and carries out wake-up judgement.
Embodiment of the disclosure additionally provides a kind of computer readable storage medium, is stored thereon with computer program,
It is characterized in that, which realizes the step in above-mentioned method when being executed by processor.
Embodiment of the disclosure additionally provides a kind of intelligent electronic device, which includes: voice collecting list
Member, for acquiring voice data;Processor;Memory is stored thereon with computer instruction, in the computer instruction by the processing
Device realizes the above method when executing.
Detailed description of the invention
In order to illustrate more clearly of the technical solution of the embodiment of the present disclosure, make below by required in the description to embodiment
Attached drawing is briefly described.The accompanying drawings in the following description is only the exemplary embodiment of the disclosure.
Fig. 1 is to show the schematic diagram that scene is waken up according to the voice of the embodiment of the present disclosure.
Fig. 2 is the flow chart for showing the voice awakening method according to the embodiment of the present disclosure.
Fig. 3 is the schematic diagram for showing the voice awakening method according to the embodiment of the present disclosure.
Fig. 4 is the another schematic diagram for showing the voice awakening method according to the embodiment of the present disclosure.
Fig. 5 is the schematic diagram for showing the voice Rouser according to the embodiment of the present disclosure.
Fig. 6 is the structure chart for showing the intelligent electronic device according to the embodiment of the present disclosure.
Fig. 7 is the schematic diagram for showing a kind of terminal dual model system waken up for voice.
Specific embodiment
In order to enable the purposes, technical schemes and advantages of the disclosure become apparent, root is described in detail below with reference to accompanying drawings
According to the example embodiment of the disclosure.Obviously, described embodiment is only a part of this disclosure embodiment, rather than this public affairs
The whole embodiments opened, it should be appreciated that the disclosure is not limited by example embodiment described herein.
In the present description and drawings, there is substantially the same or similar steps and the same or similar attached drawing mark of element
Note will be omitted the repeated description of these steps and element to indicate.Meanwhile in the description of the disclosure, term " the
One ", " second " etc. is only used for distinguishing description, is not understood to indicate or imply relative importance or sequence.
The disclosure for ease of description, concept related with the disclosure introduced below.
Artificial intelligence (Artificial Intelligence, AI) is to utilize digital computer or digital computer control
Machine simulation, extension and the intelligence for extending people of system, perception environment obtain knowledge and the reason using Knowledge Acquirement optimum
By, method, technology and application system.In other words, artificial intelligence is a complex art of computer science, it attempts to understand
The essence of intelligence, and produce a kind of new intelligence machine that can be made a response in such a way that human intelligence is similar.Artificial intelligence
The design principle and implementation method for namely studying various intelligence machines make machine have the function of perception, reasoning and decision.
Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer
The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage,
The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer
Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.
The key technology of voice technology (Speech Technology) has automatic speech recognition technology (ASR) and voice to close
At technology (TTS) and sound groove recognition technology in e.It allows computer capacity to listen, can see, can say, can feel, being the hair of the following human-computer interaction
Direction is opened up, wherein voice becomes following one of the man-machine interaction mode being most expected.Currently, automatic speech recognition technology by
It is widely used in every field.Voice wake up one of branch as automatic speech recognition technology of detection technique also by
It is widely used among various intelligent electronic devices, using one of the usual way as these intelligent electronic devices of wake-up.
Fig. 1 is the schematic diagram for showing the scene 100 that detection is waken up according to the voice of the embodiment of the present disclosure.
With reference to Fig. 1, in scene 100, user A and user B are realized by voice and the dialogue of smart machine 101
It is interacted with smart machine 101.
Smart machine 101 can be any smart machine, for example, intelligent electronic device (for example, intelligent sound box, intelligence electricity
Depending on, intelligent gateway etc.), smart phone and intelligent vehicle-carried equipment etc..Smart machine 101, which can also be, to be taken in above equipment
The voice assistant device of load, voice assistant software etc..Smart machine 101 is closed recognizing user and said correct voice and wake up
When keyword, the content that keyword can be waken up according to voice executes various operations.For example, being waken up when user says correct voice
When keyword (for example, the user A in Fig. 1 says " Doraemon "), smart machine 101 can recognize user and say correctly
Voice wakes up keyword, and activates from dormant state to operating status.And when the voice that user says mistake wakes up keyword (example
Such as, the user B in Fig. 1 says " goodbye ") when, smart machine 101 then continues to keep dormant state.
Detection technique is waken up it is usually necessary to use voice to realize above-mentioned scene 100.Voice wakes up detection (also referred to as
Keyword positioning (keyword spotting, KWS)) technology refers to and detects whether it includes one in one section of voice data
Specific sound bite.In general, this special sound segment includes that voice wakes up keyword, " Doraemon " of example as shown in figure 1.
The various systems that can be realized voice awakening technology include depth keyword system (Deep Kws System), close
Keyword/filter Hidden Markov Model system, terminal dual model system and cloud second-level model system etc..In actual industry
All there is respective defect using upper.
For example, depth keyword system is single model structure, and the wake-up of balance is obtained using deep neural network
Performance.Since the system has only used single model structure, performance is difficult to reach in the case where far field, band such as make an uproar at the complex application contexts
To enough discriminations.
Fig. 7 shows a kind of schematic diagram of terminal dual model system 700 waken up for voice.Terminal as shown in Figure 7
Dual model system is utilized two complicated neural networks and carries out a large amount of calculate to obtain relatively accurate wake-up result.Terminal
Dual model system includes low calculation amount module 701 and accurately calculates module 702.Low calculation amount module 701 includes MFCC feature meter
Calculate module, feature cache module, small-sized deep neural network (small-sized DNN) module, the first Hidden Markov score (the first HMM
Score) module.Wherein, small-sized deep neural network module is respectively used to tentatively judge whether input voice is crucial with voice wake-up
Word is related, and exports the first association probability.First Hidden Markov obtains sub-module and determines the first confidence according to the first association probability
Degree.Accurately calculating module 702 includes large-scale deep neural network (large-scale DNN) module and the second Hidden Markov score (second
HMM score) module.After low calculation amount module 701 detects that user has said voice wake-up keyword, by feature cache module
In characteristic be input to the large-scale deep neural network module accurately calculated in module 702.Large-scale deep neural network mould
Whether block judges input voice again related to voice wake-up keyword, and exports the second association probability to the second Hidden Markov
Sub-module is obtained to obtain the second confidence level.Because terminal dual model system 700 uses the mind of two concatenated complexity at the terminal
Through network, and second level neural network is bigger than the calculation amount of first order neural network, more computing resources is needed, to intelligence
Can electronic equipment it is more demanding.
Cloud second-level model system carries out waking up judgement also with two above-mentioned neural networks, to mitigate terminal
The calculation amount of side placed the complicated Secondary Neural Networks of system beyond the clouds.But due to the system need network with
Cloud is verified, and there is technical issues that.
The disclosure proposes a kind of improved voice awakening method based on artificial intelligence, and this method is by using two classification nets
Network can reduce calculation amount, shorten the accuracy for postponing and improving smart machine response as second level neural network.
Fig. 2 is the flow chart for showing the voice awakening method 200 according to the embodiment of the present disclosure.
Voice awakening method 200 according to an embodiment of the present disclosure can be applied in any smart machine, can also be with
It executes and court verdict is back to in wake-up device beyond the clouds then.In the following, being carried out by taking the smart machine 101 in Fig. 1 as an example
Explanation.
Firstly, obtaining the audio frequency characteristics set of voice data in step S201.
Specifically, above-mentioned voice data may include the sound being captured in a variety of manners and be converted into digital text
The voice data of part form storage, for example, the voice data etc. periodically captured by the microphone of smart machine 101.Voice
Data can be buffered in the memory of smart machine 101 to carry out next step analysis.Voice data can with .mp3,
.wav .voc and .au format etc. is encoded or is stored.The disclosure does not carry out any restrictions to the format of voice data.
Each element in above-mentioned audio frequency characteristics set refers to the audio characteristic data that can be extracted from voice data.
In order to characterize voice data and identify the voice data, it usually needs to the sound frequency of the voice data, volume, mood, sound
The data such as height, energy are analyzed.These data can be referred to as " audio characteristic data " of the voice data.
For the ease of the analysis of voice data, above-mentioned audio characteristic data further can be special using various voices
Sign is extracted model and is obtained.Speech feature extraction model includes but is not limited to FBANK (also known as FilterBank) or MFCC etc..
FBANK voice feature data is also known as by the audio characteristic data that FBANK speech feature extraction model extraction goes out.The disclosure will
It is illustrated by taking FBANK voice feature data as an example, but the disclosure is not limited thereto.FBANK speech feature extraction model
Audio frequency characteristics can be extracted in a manner of being similar to human ear and handle the sound that it is heard.FBANK speech feature extraction
Model is by carrying out the operations such as Fourier transformation, energy spectrum calculating and Mel filtering, available energy to the voice data of framing
Enough characterize the array (also referred to as FBank feature vector) of each frame voice data.The array is FBANK audio frequency characteristics number
According to.
In step S202, it is based on audio frequency characteristics set, detection voice wakes up keyword.
Specifically, voice number can be detected by further analyzing each audio characteristic data in audio frequency characteristics set
It whether include that voice wakes up keyword in.Voice, which wakes up keyword, can be the pre-set any keyword of user or intelligence
Default keyword in equipment 101, " Doraemon " of example as shown in figure 1.The language that keyword is waken up including voice can be determined in advance
The voice feature data of sound data.Then by audio frequency characteristics set audio characteristic data and these predetermined voices it is special
Sign data are compared, and are matched so that it is determined that whether audio frequency characteristics set wakes up keyword with voice.For example, can in advance really
The FBANK voice feature data of fixed " Doraemon " the words, then will obtain in the FBANK voice feature data and step S201
Audio frequency characteristics set be compared, determine whether to detect that voice wakes up keyword.
The step of above-mentioned detection voice wake-up keyword further includes that these audios spy is determined using keyword detection network
Collection closes whether match with voice wake-up keyword.The keyword detection network can be various model structures, such as
DNN, CNN or LSTM etc..The keyword detection network can be using acoustic model, and the acoustic model is using phoneme tags come really
Whether accordatura frequency characteristic set, which wakes up keyword with voice, matches.Phoneme refer to according to the natural quality of voice mark off come
Least speech unit is the determination according to the articulation in syllable.For example, including one if Chinese syllable ā ()
Phoneme, à i (love) then include two phonemes.Since voice, which wakes up keyword, can be divided into multiple phonemes, so as to benefit
Indicate that voice wakes up the phonetic feature of keyword with multiple phoneme tags.Keyword detection network system can successively calculate sound
Association probability of each audio data characteristics compared with voice wake-up keyword phoneme tags in frequency characteristic set.These are closed
Join probability and carry out collect statistics, to obtain the confidence level for waking up keyword in voice data including voice.Confidence level is higher than pre-
Determining threshold value indicates to detect voice keyword.
Certainly, keyword detection network, which can also be, can identify that voice wakes up other neural networks of keyword, for example,
Hidden Markov neural network (HMM) and Gaussian Mixture neural network (CMM) etc..
It is special to the audio using two sorter networks in the case where detecting that voice wakes up keyword in step S203
Collection closes and carries out wake-up judgement.
Specifically, above-mentioned two sorter network (binary classification model can also be referred to as) refer to by input be divided into two classes (
That is, output is 0 or neural network 1).In the case where detecting that voice wakes up keyword, two above-mentioned sorter networks are swashed
It is living, thus to further being adjudicated in audio frequency characteristics set.The model parameter amount of two sorter networks is much smaller than above-mentioned pass
Keyword detects the model parameter amount of network model, therefore can reduce the calculation amount of system.Two sorter networks are to audio frequency characteristics collection
The wake-up judgement of conjunction can execute beyond the clouds, can also execute in terminal, the disclosure is without limitation.
More specifically, two above-mentioned sorter networks may include multiple layers: input layer, at least one hidden layer and output
Layer.It include multiple nodes in each hidden layer.Node can be neuron (cell) or perceptron, and each node can have multiple
Input.Output layer includes the node less than or equal to 2.Each node can have different weight and biasing to its any input.
And weight and the value of biasing are trained by sample data.
Two sorter networks can be the neural network connected entirely.The neural network connected entirely refers to: in the phase of neural network
Each node in two layers adjacent is all connection.For example, each node in input layer and the hidden layer near input layer
Each node is all connected with.Each node in each adjacent hidden layer is also interconnected.Near hiding for output layer
Each node in layer is also connected with two nodes of output layer.It can be from more angles using the neural network connected entirely
The audio characteristic data for analyzing input, to obtain more accurate court verdict.
Specifically, multiple audio characteristic datas in the audio frequency characteristics set can be synthesized and represents audio frequency characteristics number
According to carrying out wake-up judgement to audio characteristic data is represented using the full Connection Neural Network." representing audio characteristic data " table
Show the audio characteristic data that can characterize/represent the audio frequency characteristics set.For example, " representing audio characteristic data " can be in sound
Chosen in frequency characteristic set the audio characteristic data of predetermined quantity in chronological sequence sequential concatenation together and the audio that is formed is special
Levy data.After each element in audio frequency characteristics set can also be carried out other secondary treatments by " representing audio characteristic data "
The audio characteristic data of extraction.The disclosure does not limit the concrete form of " representing audio characteristic data ", as long as the sound can be characterized
Frequency characteristic set.
The input layer for audio characteristic data will be represented being input to full Connection Neural Network, it is defeated by least one hidden layer
Layer can export " 0 " for indicating not wake up smart machine and indicate to wake up " 1 " of smart machine out.Output layer can also export one
It is a to be more than or equal to 0 and the real number less than 1.It then indicates to wake up intelligent electronic device when the value of the real number is greater than predetermined threshold.By
This, two sorter networks are completed the wake-up carried out to audio frequency characteristics set and are adjudicated.
In the case where voice data is judged as wake-up, smart machine 101 can be waken up.For example, when two sorter networks
When positioned at cloud, Cloud Server can send signal to smart machine 101 by cable network and/or wireless network, to trigger
Conversion of the smart machine 101 from dormant state to working condition.When two sorter networks are located at smart machine 101, judgement is waken up
It can directly activate smart machine 101 to be transformed into working condition from dormant state.It is judged as not waking up when voice data
In the case of, smart machine 101 can keep dormant state or do nothing.
It is realized as a result, by two sorter networks compared with mini Mod according to the voice awakening method 200 of the embodiment of the present disclosure
Most false wake-up can be effectively inhibited in the case where parameter amount, to reduce calculation amount significantly, shorten delay simultaneously
Improve the accuracy of smart machine response.Compared to the single model neural network that complexity is used only or use multiple identical frames
For the common voice awakening technology of Complex Neural Network model, voice awakening method 200 is answered in complexity such as far field, high noisies
With the level that can achieve industrial application under scene, correctly wake-up device, raising smart machine are whole in the case where low latency
The ease for use of body.
Fig. 3 is the schematic diagram for showing the voice awakening method 200 according to the embodiment of the present disclosure.
As shown in figure 3, the audio frequency characteristics set 302 for obtaining voice data 301 may include each of acquisition voice data
Frame audio characteristic data.
Specifically, referring to Fig. 3, voice data 301 can be divided into multiple frames with regular hour section.Usual situation
Under, comprising complete speech wake up keyword voice data when it is 2 to 5 seconds a length of.It can be with every 10 milliseconds for a frame, by voice
Data 301 are divided into multiple frames.In order to closer to human ear to the processing mode of voice data, between the voice data of two adjacent frames
There can be lap.Such as first frame voice data can be the 0th millisecond to the 10th millisecond of data of the voice data, the
Two frame voice data can be the 8th millisecond to the 18th millisecond of data of the voice data.
It is then possible to be handled each frame voice data to obtain the audio characteristic data of each frame of the voice data
(the step of such as Fig. 3, is 1.).For example, can use above-mentioned FBANK model to obtain the FBANK audio characteristic data of each frame.
Each frame audio characteristic data can be the array of L dimension, and wherein L is more than or equal to 1.Optionally, L is equal to 13.Audio frequency characteristics
It may include the audio characteristic data of multiple continuous frames in set 302.
With continued reference to Fig. 3, acquired each frame audio characteristic data (such as Fig. 3 can be cached according to predetermined cache rule
The step of 2.).For example, each frame audio characteristic data can be sequentially input into caching 303.Wherein, predetermined cache rule
Including but not limited to: according to first in, first out rule, caching the audio characteristic data of the continuous frame of predetermined quantity;Or it is detecting
After predetermined phoneme tags, the audio characteristic data of the continuous frame of predetermined quantity is cached.Optionally, the size for caching 303 can
Just size needed for covering identification voice wake-up keyword.For example, it is assumed that identification voice wakes up keyword " Doraemon " about
The audio characteristic data of M frame is needed, then the size for caching 303 can be M*L bit.
Caching 303 can sequentially input first frame to nth frame audio characteristic data to keyword detection network 304 (such as
The step of Fig. 3, is 3.).To obtain more accurately as a result, keyword detection network 304 can be a complicated depth nerve net
Network.Specifically, as shown in figure 3, keyword detection network 304 may include one or more hidden layers.It is wrapped in each hidden layer
Multiple neurons (cell) is included, each neuron there can be multiple inputs.For example, closest to the nerve in the hidden layer of input layer
Member input can be the data of Arbitrary Dimensions in the audio characteristic data of L dimension.Each neuron to it is each input have weight and partially
It sets.Weight and the value of biasing are obtained by the training of a large amount of sample data.Keyword detection network 304 in Fig. 3 is only
It is example, there can also be other structures.The disclosure is not to the section in the structure of keyword detection network 304, each layer
Connection type between point quantity and node is limited.
Each frame audio characteristic data in caching is waken up keyword with voice by above-mentioned keyword detection network 304
Phoneme tags compare, to determine the association probability of the frame audio characteristic data and the phoneme tags.Keyword detection network
304 can be with the audio characteristic data of one frame of single treatment, can also be with the audio characteristic data of single treatment multiframe.With single treatment
It is illustrated for the audio characteristic data of one frame.Keyword detection network 304 can calculate jth frame audio characteristic data and
Associated probability P between i phoneme tagsijNamely association probability Pij.Wherein, i and j is the integer more than or equal to 0.Example
Such as, keyword detection network 304 can be incited somebody to action when handling the first frame audio characteristic data of voice wake-up keyword " Doraemon "
The voice wakes up first phoneme tags " x " of keyword compared with first frame audio characteristic data, and exports the first phoneme
Label " x " probability P associated with first frame audio characteristic data11。
Due to association probability PijNoise is usually contained, therefore can be adopted before the confidence level for calculating voice wake-up keyword
It is smoothed with smoothing windows.For example, association probability P can be handled using following formula (1)ijIt is smooth to obtain
Association probability P afterwardsij’。
In formula (1), k is indicated in hsmoothArbitrary value between j, hsmoothIndicate the first frame number in the smoothing windows
According to index/frame number.hsmoothIt can be calculated with following formula (2):
hsmooth=max { 1, j-wsmooth+1} (2)
Above-mentioned wsmoothRefer to the size of smoothing windows.Such as when the size of smoothing windows is 6 frames, j=10, i=9, after smooth
Association probability Pij' it is that the audio characteristic data of audio characteristic data to the 10th frame of the 5th frame is closed with the 9th phoneme tags respectively
The average value of the probability of connection, at this time hsmoothEqual to 5.It will be between the crucial probability that reduce continuous multiple frames by smoothing processing
Noise so that confidence level is more accurate.
Then, keyword detection network 304 can be by smoothed out association probability Pij' sequentially input one by one to confidence level meter
It calculates window 305 (the step of such as Fig. 3 is 4.), without disposably calculating all association probability Pij'.Keyword detection network
304 can calculate the confidence level for detecting that the voice wakes up word in jth frame.Assuming that the window of confidence calculations window 305 is big
Small is wmax, wmaxGreater than 1.Specifically, confidence calculations window 305 can use following formula (3):
To calculate the confidence level for detecting that voice wakes up keyword in audio frequency characteristics set 304.In above-mentioned formula (3)
In, n indicates the index of the phoneme tags currently calculated.For example, it is assumed that voice, which wakes up keyword, 30 phoneme tags, and it is current
The 25th phoneme tags are being handled, then n is equal to 25 at this time.M indicates hmaxArbitrary value between j.hmaxIt indicates in confidence level
Index/frame number of first frame in calculation window, hmaxIt can be obtained with following formula (4):
hmax=max { 1, j-wmax+1} (4)
Arrive formula (4) according to above-mentioned formula (1), confidence calculations window 305 usually former frames when the confidence level that exports compared with
It is small.Because there are no most of phoneme tags that the data in audio frequency characteristics set are waken up keyword with voice to compare at this time
Compared with.With the increase for the audio characteristic data being compared, confidence level will constantly change.If called out in voice data including voice
Awake keyword, the confidence level that confidence calculations window 305 exports may be with the increase for the audio characteristic data being compared
And increase.When confidence level reaches a certain specific threshold, that is, determine that detecting the voice wakes up keyword.Such as in formula
(3) in, it is assumed that one shares 30 phoneme tags, it is understood that there may be the case where confidence level has been over threshold value as n=25.At this time
The association probability between 26-30 phoneme tags and audio characteristic data can be no longer calculated, and directly judges to detect
Voice wakes up keyword.If not including that voice wakes up keyword, the confidence that confidence calculations window 305 exports in voice data
Degree cannot will reach specific threshold always, to will determine not detecting that voice wakes up keyword.
Above-mentioned keyword detection network 304 and confidence calculations window 305 can be with parallel computations to reduce delay.
As described above, two sorter networks 306 can be activated to come to audio frequency characteristics collection after detecting that voice wakes up keyword
It closes 302 and carries out wake-up judgement.Specifically, when confidence level is greater than threshold value, confidence calculations window 305 can be sent out to caching 303
Send a signal specific (the step of such as Fig. 3 is 5.).The audio characteristic data that caching 303 has cached it is sent to two sorter networks
306 (the step of such as Fig. 3, is 6.).Specifically, it is assumed that in jth frame, the confidence level that confidence calculations window 306 exports is greater than threshold
Value.At this point it is possible to which the audio characteristic data of jth-p frame to jth+p frame (p is the natural number more than or equal to 0) synthesizes in caching
To represent audio characteristic data, and it is input to two sorter networks 306.It is of course also possible to which audio all in caching is special
Sign number is input to two sorter networks 306.Two sorter networks 306 can judge whether to wake up smart machine according to above-mentioned method
101。
Two sorter networks 306 in Fig. 3 are shown in a manner of fully-connected network, only as example, this field
Technical staff should determine that two sorter networks 306 can also be other structures, and for example including multiple hidden layers, the disclosure is not
Any restrictions are done to its structure.
Fig. 4 is the another schematic diagram for showing the voice awakening method 200 according to the embodiment of the present disclosure.
With reference to Fig. 4, the voice awakening method 200 of the disclosure can be realized by two modules namely high wake-up rate module
401 and low false wake-up rate module 402.
High wake-up rate module 401 includes FBANK feature calculation module, feature cache module, keyword detection network, posteriority
Processing module.Wherein, FBANK feature calculation module is used to calculate the FBANK feature of audio input, such as realizes the step in Fig. 3
Suddenly 1..2. feature cache module is for storing the step in FBANK feature calculated, such as realization Fig. 3.Keyword detection net
Network wakes up keyword for detecting voice, which can be similar to the keyword detection network 304 in Fig. 3.
The posteriority processing module association probability that keyword detection network exports for further processing is (since the association probability is given
Calculated in the case where condition/input, which is also referred to as posterior probability), the confidence level meter that can be similar in Fig. 3
Calculate window 305.
Specifically, it is realized in high wake-up rate module 401 using keyword detection network and wakes up word detection, and can be real
Existing higher wake-up rate.For this purpose, the audio data sample of the training keyword detection network can be clearly and noise is relatively high
's.Assuming that training the keyword detection network using the first voice data sample set, and to the first voice data sample set
In the signal-to-noise ratio of each voice data sample average, result can be the first average signal-to-noise ratio.First average noise
Than can be relatively high.For example, the first voice data sample set may include that user clearly says language in quiet environment
Sound wakes up the sample A of keyword, such as user says the voice data sample of " Doraemon ".In order to the apparent area sample A
Point, the first voice data sample set can also clearly say the non-voice randomly selected including user in quiet environment and call out
Wake up the sample B of keyword, for example, user say " goodbye ", " hello ", " weather is very good " this kind of words voice data sample
This.
Using the keyword detection network of above-mentioned the first voice data sample set training in the input for handling low signal-to-noise ratio
When data it is possible that the case where high false wake-up rate." false wake-up rate ", which refers to, will not include the voice number of voice wake-up keyword
According to be identified as comprising voice wake up keyword probability.For example, there is the voice data when processing a large amount of music or TV to make an uproar
When sound, it includes that the voice is called out that keyword detection network, which will may not include voice to wake up the voice data wrong identification of keyword,
Awake keyword.For example, the voice data mistake comprising " goodbye ding-dong " is identified as the voice data comprising " Doraemon ".For
This, can use low false wake-up rate module 402 and carry out wake-up judgement to the voice data, to reduce voice awakening method 200
False wake-up rate.
Low false wake-up rate module 402 includes two sorter networks and threshold value judging module.Two sorter network is similar in Fig. 3
Two sorter networks 306.Threshold value judging module is then used to determine whether based on the output of two sorter networks to wake up smart electronics
Equipment.
Low false wake-up rate module 402 realizes that the wake-up to above-mentioned voice data is adjudicated using two sorter networks, to realize
Low false wake-up rate.Two sorter network is trained using second speech data sample set, second speech data sample set tool
There is the second average signal-to-noise ratio.Second average signal-to-noise ratio is less than the first average signal-to-noise ratio.For example, in second speech data sample set
Data sample can be the voice data sample that the sample data in the first voice data sample set is synthesized with various noise datas.
Noise data can be strong noise data, be also possible to true music, TV background sound data etc..Second speech data sample
Collection also may include that user says the sample A ' that voice wakes up keyword in noisy environment.Certainly, second speech data sample
This collection can also include that user says the sample B that random non-voice wakes up keyword in noisy environment '.
It completes keyword detection network after training, will can be marked in advance whether to wake up comprising voice and close
The second speech data sample set of keyword inputs keyword detection network.According to the output of keyword detection network, by the second language
Voice data sample in the classification of sound set of data samples is positive sample voice data and negative sample voice data.Positive sample voice number
According to being voice data that keyword detection network correctly identifies, and negative sample voice data is the identification of keyword detection network error
Voice data.Two sorter networks are trained using the positive sample voice data and negative sample voice data.
Two sorter networks after the completion of training can the output result to keyword detection network optimize, namely to pass
The result that keyword detects network carries out the judgement of correctness, effectively presses down to realize in the case where guaranteeing higher wake-up rate
System falls most false wake-up.Simultaneously as two sorter networks are the neural networks of lightweight, it can't bring and excessive be
Expense of uniting is obviously improved wake-up performance to realize in the case where not influencing system performance.
Fig. 5 is the schematic diagram for showing the voice Rouser 500 according to the embodiment of the present disclosure.
Voice Rouser 500 according to an embodiment of the present disclosure includes that speech data extraction module 501, first handles mould
Block 502 and Second processing module 503.Wherein, speech data extraction module 501 is used to obtain the audio frequency characteristics collection of voice data
It closes.First processing module 502 is used to be based on the audio frequency characteristics set, and detection voice wakes up keyword.Second processing module 503
Sentence for wake up to the audio frequency characteristics set using two sorter networks in the case where detecting that voice wakes up keyword
Certainly.
Voice Rouser 500 further include: wake-up module 504 is used to be judged as wake-up in the voice data
In the case of, wake up intelligent electronic device.
Two sorter networks in voice Rouser 500 include full Connection Neural Network.Wherein, detecting that voice calls out
It wakes up in the case where keyword, the audio frequency characteristics set is carried out waking up judgement including: to detect language using two sorter networks
In the case that sound wakes up keyword, full Connection Neural Network is activated, multiple audio characteristic datas in audio frequency characteristics set are closed
As audio characteristic data is represented, wake-up judgement is carried out to the audio characteristic data that represents using full Connection Neural Network.
Voice Rouser 500 carries out voice by concatenated first processing module 502 and Second processing module 503 respectively
It wakes up the detection of keyword and wakes up judgement.Compared to common voice awakening technology, higher wake-up rate can be obtained and shown
Landing reduces false wake-up.
Specifically, above-mentioned keyword detection network can be used to carry out voice and wake up key in first processing module 502
The detection of word.The keyword detection network is using acoustic model and inputs association probability (also referred to as, the acoustics between voice data
Model posterior probability) and confidence calculations wake-up judgement is carried out to the voice of input.
Optionally, keyword detection network can cache when acoustic model posterior probability is calculated with confidence calculations
The audio characteristic data of fixed window size.It is waken up when confidence level calculated reaches specific threshold, confirmly detecting voice
Keyword.Later, the audio characteristic data of the fixed window size cached can be sent to second by first processing module 502
Processing module 503.
Second processing module 503 can be used two points after the audio characteristic data for receiving the transmission of first processing module 502
Class network carries out wake-up judgement.
As described above, two sorter networks in Second processing module 503 can be with joined a large amount of music, TV etc.
What the sample data of noise data was judged.Since two above-mentioned sorter networks are the network of lightweight, two sorter network
The false wake-up performance of system can be significantly improved in the case where excessive additional overhead will not be brought to system by guaranteeing.
Fig. 6 is the structure chart for showing the intelligent electronic device 600 according to the embodiment of the present disclosure.
Referring to Fig. 6, intelligent electronic device 600 may include processor 601, memory 602 and voice collecting unit 604.
Processor 601, memory 602 and voice collecting unit 604 can be connected by bus 503.Intelligent electronic device 600 can be with
It is intelligent sound, smart television, Intelligent set top box or smart phone etc..
Processor 601 can execute various movements and processing according to the program of storage in the memory 602.Specifically, locate
Reason device 601 can be a kind of IC chip, the processing capacity with signal.Above-mentioned processor can be general processor,
Digital signal processor (DSP), specific integrated circuit (ASIC), ready-made programmable gate array (FPGA) or other programmable patrol
Collect device, discrete gate or transistor logic, discrete hardware components.It may be implemented or execute in the embodiment of the present application
Disclosed each method, step and logic diagram.General processor can be microprocessor or the processor be also possible to it is any
Conventional processor etc. can be X86-based or ARM framework.
Memory 602 is stored with computer instruction, realizes that above-mentioned voice is called out when computer instruction is executed by processor 601
Awake method 200.Memory 602 can be volatile memory or nonvolatile memory, or may include volatibility and non-volatile
Both property memories.Nonvolatile memory can be read-only memory (ROM), programmable read only memory (PROM), erasable
Except programmable read only memory (EPROM), electrically erasable programmable read-only memory (EEPROM) or flash memory.Volatile memory
It can be random access memory (RAM), be used as External Cache.It is many by exemplary but be not restricted explanation
The RAM of form is available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic
Random access memory (SDRAM), double data speed synchronous dynamic RAM DDRSDRAM), enhanced synchronization
Dynamic random access memory (ESDRAM), synchronized links dynamic random access memory (SLDRAM) and direct rambus with
Machine accesses memory (DR RAM).It should be noted that the memory of method described herein be intended to include but be not limited to these and it is any
The memory of other suitable types.
Voice collecting unit 604 can be the energy conversion unit that voice signal can be converted to electric signal, such as wheat
Gram wind.Voice collecting unit 604 can carry out acoustic-electric conversion: electrodynamic type (moving-coil type, aluminium band type), condenser type in a variety of manners
(direct current polarization formula), piezoelectric type (crystal formula, ceramic-type) and electromagnetic type, carbon granules formula, semiconductor-type etc..Voice collecting unit
The electric signal of acquisition can be stored in memory 602 in a manner of digital document.
The disclosure additionally provides a kind of computer readable storage medium, is stored thereon with computer instruction, which refers to
Voice awakening method 200 is realized when order is executed by processor.Similarly, the computer readable storage medium in the embodiment of the present disclosure
It can be volatile memory or nonvolatile memory, or may include both volatile and non-volatile memories.It should be noted that
Computer readable storage medium described herein is intended to include but is not limited to the memory of these and any other suitable type.
Voice awakening method, device, computer readable storage medium and the intelligent electronic device of the embodiment of the present disclosure, can be with
It solves that computationally intensive, delay in current voice awakening technology is big or the response technical problems such as slowly, voice is provided and wakes up skill
The ease for use of art.
It should be noted that flow chart and block diagram in attached drawing, illustrate the system according to the various embodiments of the disclosure, side
The architecture, function and operation in the cards of method and computer program product.In this regard, every in flowchart or block diagram
A box can represent a part of a module, program segment or code, a part packet of the module, program segment or code
Containing one or more executable instructions for implementing the specified logical function.It should also be noted that in some realities as replacement
In existing, function marked in the box can also occur in a different order than that indicated in the drawings.For example, two earth's surfaces in succession
The box shown can actually be basically executed in parallel, they can also be executed in the opposite order sometimes, this is according to related
Depending on function.It is also noted that each box in block diagram and or flow chart and the box in block diagram and or flow chart
Combination, can the dedicated hardware based systems of the functions or operations as defined in executing realize, or can be with dedicated
The combination of hardware and computer instruction is realized.
In general, the various example embodiments of the disclosure can in hardware or special circuit, software, firmware, logic, or
Implement in any combination thereof.Some aspects can be implemented within hardware, and other aspects can be can be by controller, micro process
Implement in the firmware or software that device or other calculating equipment execute.When the various aspects of embodiment of the disclosure are illustrated or described as
When block diagram, flow chart or other certain graphical representations of use, it will be understood that box described herein, device, system, techniques or methods
Can be used as unrestricted example hardware, software, firmware, special circuit or logic, common hardware or controller or other
It calculates and implements in equipment or its certain combination.
The example embodiments of the present invention being described in detail above is merely illustrative, rather than restrictive.Ability
Field technique personnel, can be to these embodiments or its feature it should be understood that without departing from the principles and spirit of the present invention
It carry out various modifications and combines, such modification should be fallen within the scope of the present invention.
Claims (15)
1. a kind of voice awakening method based on artificial intelligence comprising:
Obtain the audio frequency characteristics set of voice data;
Based on the audio frequency characteristics set, detects voice and wake up keyword;And
In the case where detecting that voice wakes up keyword, the audio frequency characteristics set wake up using two sorter networks and is sentenced
Certainly.
2. the voice awakening method based on artificial intelligence as described in claim 1, wherein two sorter network includes connecting entirely
Neural network is connect,
Wherein, in the case where detecting that voice wakes up keyword, the audio frequency characteristics set is carried out using two sorter networks
Waking up judgement includes:
In the case where detecting that voice wakes up keyword, the full Connection Neural Network is activated,
Multiple audio characteristic datas in the audio frequency characteristics set are synthesized and represent audio characteristic data, are connected entirely using described
It connects neural network and wake-up judgement is carried out to the audio characteristic data that represents.
3. the voice awakening method based on artificial intelligence as described in claim 1, wherein obtain the audio frequency characteristics of voice data
Set includes:
Obtain each frame audio characteristic data of voice data;And
According to predetermined cache rule, acquired each frame audio characteristic data is cached,
Wherein, audio frequency characteristics set includes the audio characteristic data of multiple continuous frames.
4. the voice awakening method based on artificial intelligence as claimed in claim 3, wherein the voice wakes up keyword and includes
Multiple phoneme tags, and keyword is waken up based on audio frequency characteristics set detection voice and includes:
Using keyword detection network, by the phoneme of each frame audio characteristic data and voice wake-up keyword in caching
Label compares, to determine the association probability of the frame audio characteristic data and the phoneme tags;
According to the association probability, the confidence for detecting that the voice wakes up keyword in the audio frequency characteristics set is determined
Degree.
5. the voice awakening method based on artificial intelligence as claimed in claim 3, wherein the predetermined cache rule include with
It is at least one of lower:
According to first in, first out rule, the audio characteristic data of the continuous frame of predetermined quantity is cached;
After detecting predetermined phoneme tags, the audio characteristic data of the continuous frame of predetermined quantity is cached.
6. the voice awakening method based on artificial intelligence as claimed in claim 4, wherein
Two sorter network is completed and then trained in the keyword detection network training.
7. the voice awakening method based on artificial intelligence as claimed in claim 6, wherein
The keyword detection network utilizes the training of the first voice data sample set;
Wherein, the first voice data sample set has the first average signal-to-noise ratio, also, in the first voice data sample set at least
A part is the voice data for including the voice wake-up keyword.
8. the voice awakening method based on artificial intelligence as claimed in claim 7, in which:
Two sorter network utilizes the training of second speech data sample set;
Wherein, second speech data sample set has the second average signal-to-noise ratio, and the first average signal-to-noise ratio is higher than second and is averaged
Signal-to-noise ratio.
9. the voice awakening method based on artificial intelligence as described in claim 1, in which:
In the case where the voice data is judged as wake-up, intelligent electronic device is waken up.
10. a kind of voice Rouser comprising:
Speech data extraction module, for obtaining the audio frequency characteristics set of voice data;
First processing module, for being based on the audio frequency characteristics set, detection voice wakes up keyword;With
Second processing module, for detect voice wake up keyword in the case where, using two sorter networks to the audio
Characteristic set carries out wake-up judgement.
11. voice Rouser as claimed in claim 10, further include:
Wake-up module, for waking up intelligent electronic device in the case where the voice data is judged as wake-up.
12. voice Rouser as claimed in claim 10, wherein
Two sorter network includes full Connection Neural Network,
Wherein, in the case where detecting that voice wakes up keyword, the audio frequency characteristics set is carried out using two sorter networks
Waking up judgement includes:
In the case where detecting that voice wakes up keyword, the full Connection Neural Network is activated,
Multiple audio characteristic datas in the audio frequency characteristics set are synthesized and represent audio characteristic data, are connected entirely using described
It connects neural network and wake-up judgement is carried out to the audio characteristic data that represents.
13. a kind of intelligent electronic device, comprising:
Voice collecting unit, for acquiring voice data;
Processor,
Memory is stored thereon with computer instruction, and such as right is realized when the computer instruction is executed by the processor
It is required that method described in any one of 1-9.
14. intelligent electronic device as claimed in claim 13, wherein the intelligent electronic device is intelligent sound, intelligence electricity
Depending on, Intelligent set top box or smart phone.
15. a kind of computer readable storage medium, is stored thereon with computer instruction, the computer instruction is executed by processor
Method of the Shi Shixian as described in any one of claim 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910747867.6A CN110364143B (en) | 2019-08-14 | 2019-08-14 | Voice awakening method and device and intelligent electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910747867.6A CN110364143B (en) | 2019-08-14 | 2019-08-14 | Voice awakening method and device and intelligent electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110364143A true CN110364143A (en) | 2019-10-22 |
CN110364143B CN110364143B (en) | 2022-01-28 |
Family
ID=68224739
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910747867.6A Active CN110364143B (en) | 2019-08-14 | 2019-08-14 | Voice awakening method and device and intelligent electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110364143B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110706707A (en) * | 2019-11-13 | 2020-01-17 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer-readable storage medium for voice interaction |
CN110808030A (en) * | 2019-11-22 | 2020-02-18 | 珠海格力电器股份有限公司 | Voice awakening method, system, storage medium and electronic equipment |
CN110838289A (en) * | 2019-11-14 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Awakening word detection method, device, equipment and medium based on artificial intelligence |
CN111192590A (en) * | 2020-01-21 | 2020-05-22 | 苏州思必驰信息科技有限公司 | Voice wake-up method, device, equipment and storage medium |
CN111653276A (en) * | 2020-06-22 | 2020-09-11 | 四川长虹电器股份有限公司 | Voice awakening system and method |
CN111739521A (en) * | 2020-06-19 | 2020-10-02 | 腾讯科技(深圳)有限公司 | Electronic equipment awakening method and device, electronic equipment and storage medium |
CN111816193A (en) * | 2020-08-12 | 2020-10-23 | 深圳市友杰智新科技有限公司 | Voice awakening method and device based on multi-segment network and storage medium |
CN111933114A (en) * | 2020-10-09 | 2020-11-13 | 深圳市友杰智新科技有限公司 | Training method and use method of voice awakening hybrid model and related equipment |
CN112233656A (en) * | 2020-10-09 | 2021-01-15 | 安徽讯呼信息科技有限公司 | Artificial intelligent voice awakening method |
CN112509568A (en) * | 2020-11-26 | 2021-03-16 | 北京华捷艾米科技有限公司 | Voice awakening method and device |
CN112885339A (en) * | 2019-11-14 | 2021-06-01 | 杭州智芯科微电子科技有限公司 | Voice awakening system and voice recognition system |
CN113192499A (en) * | 2020-01-10 | 2021-07-30 | 青岛海信移动通信技术股份有限公司 | Voice awakening method and terminal |
TWI767532B (en) * | 2021-01-22 | 2022-06-11 | 賽微科技股份有限公司 | A wake word recognition training system and training method thereof |
WO2022206602A1 (en) * | 2021-03-31 | 2022-10-06 | 华为技术有限公司 | Speech wakeup method and apparatus, and storage medium and system |
CN115312049A (en) * | 2022-06-30 | 2022-11-08 | 青岛海尔科技有限公司 | Command response method, storage medium and electronic device |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI489372B (en) * | 2013-04-10 | 2015-06-21 | Via Tech Inc | Voice control method and mobile terminal apparatus |
KR101794884B1 (en) * | 2016-03-15 | 2017-12-01 | 재단법인대구경북과학기술원 | Apparatus for drowsy driving prevention using voice recognition load and Method thereof |
CN107767861A (en) * | 2016-08-22 | 2018-03-06 | 科大讯飞股份有限公司 | voice awakening method, system and intelligent terminal |
CN108305617A (en) * | 2018-01-31 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The recognition methods of voice keyword and device |
CN109461448A (en) * | 2018-12-11 | 2019-03-12 | 百度在线网络技术(北京)有限公司 | Voice interactive method and device |
CN109599124A (en) * | 2018-11-23 | 2019-04-09 | 腾讯科技(深圳)有限公司 | A kind of audio data processing method, device and storage medium |
CN110033758A (en) * | 2019-04-24 | 2019-07-19 | 武汉水象电子科技有限公司 | A kind of voice wake-up implementation method based on small training set optimization decoding network |
-
2019
- 2019-08-14 CN CN201910747867.6A patent/CN110364143B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
TWI489372B (en) * | 2013-04-10 | 2015-06-21 | Via Tech Inc | Voice control method and mobile terminal apparatus |
KR101794884B1 (en) * | 2016-03-15 | 2017-12-01 | 재단법인대구경북과학기술원 | Apparatus for drowsy driving prevention using voice recognition load and Method thereof |
CN107767861A (en) * | 2016-08-22 | 2018-03-06 | 科大讯飞股份有限公司 | voice awakening method, system and intelligent terminal |
CN108305617A (en) * | 2018-01-31 | 2018-07-20 | 腾讯科技(深圳)有限公司 | The recognition methods of voice keyword and device |
CN109599124A (en) * | 2018-11-23 | 2019-04-09 | 腾讯科技(深圳)有限公司 | A kind of audio data processing method, device and storage medium |
CN109461448A (en) * | 2018-12-11 | 2019-03-12 | 百度在线网络技术(北京)有限公司 | Voice interactive method and device |
CN110033758A (en) * | 2019-04-24 | 2019-07-19 | 武汉水象电子科技有限公司 | A kind of voice wake-up implementation method based on small training set optimization decoding network |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110706707A (en) * | 2019-11-13 | 2020-01-17 | 百度在线网络技术(北京)有限公司 | Method, apparatus, device and computer-readable storage medium for voice interaction |
US11393490B2 (en) | 2019-11-13 | 2022-07-19 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus, device and computer-readable storage medium for voice interaction |
CN110838289A (en) * | 2019-11-14 | 2020-02-25 | 腾讯科技(深圳)有限公司 | Awakening word detection method, device, equipment and medium based on artificial intelligence |
CN110838289B (en) * | 2019-11-14 | 2023-08-11 | 腾讯科技(深圳)有限公司 | Wake-up word detection method, device, equipment and medium based on artificial intelligence |
CN112885339A (en) * | 2019-11-14 | 2021-06-01 | 杭州智芯科微电子科技有限公司 | Voice awakening system and voice recognition system |
CN110808030B (en) * | 2019-11-22 | 2021-01-22 | 珠海格力电器股份有限公司 | Voice awakening method, system, storage medium and electronic equipment |
CN110808030A (en) * | 2019-11-22 | 2020-02-18 | 珠海格力电器股份有限公司 | Voice awakening method, system, storage medium and electronic equipment |
CN113192499A (en) * | 2020-01-10 | 2021-07-30 | 青岛海信移动通信技术股份有限公司 | Voice awakening method and terminal |
CN111192590A (en) * | 2020-01-21 | 2020-05-22 | 苏州思必驰信息科技有限公司 | Voice wake-up method, device, equipment and storage medium |
CN111739521A (en) * | 2020-06-19 | 2020-10-02 | 腾讯科技(深圳)有限公司 | Electronic equipment awakening method and device, electronic equipment and storage medium |
CN111653276A (en) * | 2020-06-22 | 2020-09-11 | 四川长虹电器股份有限公司 | Voice awakening system and method |
CN111816193B (en) * | 2020-08-12 | 2020-12-15 | 深圳市友杰智新科技有限公司 | Voice awakening method and device based on multi-segment network and storage medium |
CN111816193A (en) * | 2020-08-12 | 2020-10-23 | 深圳市友杰智新科技有限公司 | Voice awakening method and device based on multi-segment network and storage medium |
CN112233656A (en) * | 2020-10-09 | 2021-01-15 | 安徽讯呼信息科技有限公司 | Artificial intelligent voice awakening method |
CN111933114B (en) * | 2020-10-09 | 2021-02-02 | 深圳市友杰智新科技有限公司 | Training method and use method of voice awakening hybrid model and related equipment |
CN111933114A (en) * | 2020-10-09 | 2020-11-13 | 深圳市友杰智新科技有限公司 | Training method and use method of voice awakening hybrid model and related equipment |
CN112509568A (en) * | 2020-11-26 | 2021-03-16 | 北京华捷艾米科技有限公司 | Voice awakening method and device |
TWI767532B (en) * | 2021-01-22 | 2022-06-11 | 賽微科技股份有限公司 | A wake word recognition training system and training method thereof |
WO2022206602A1 (en) * | 2021-03-31 | 2022-10-06 | 华为技术有限公司 | Speech wakeup method and apparatus, and storage medium and system |
CN115312049A (en) * | 2022-06-30 | 2022-11-08 | 青岛海尔科技有限公司 | Command response method, storage medium and electronic device |
Also Published As
Publication number | Publication date |
---|---|
CN110364143B (en) | 2022-01-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110364143A (en) | Voice awakening method, device and its intelligent electronic device | |
US11908455B2 (en) | Speech separation model training method and apparatus, storage medium and computer device | |
CN108198547B (en) | Voice endpoint detection method and device, computer equipment and storage medium | |
CN112562691B (en) | Voiceprint recognition method, voiceprint recognition device, computer equipment and storage medium | |
US8930196B2 (en) | System for detecting speech interval and recognizing continuous speech in a noisy environment through real-time recognition of call commands | |
TW201935464A (en) | Method and device for voiceprint recognition based on memorability bottleneck features | |
CN110534099A (en) | Voice wakes up processing method, device, storage medium and electronic equipment | |
CN110570873B (en) | Voiceprint wake-up method and device, computer equipment and storage medium | |
CN107492382A (en) | Voiceprint extracting method and device based on neutral net | |
CN107767863A (en) | voice awakening method, system and intelligent terminal | |
CN107799126A (en) | Sound end detecting method and device based on Supervised machine learning | |
CN103065629A (en) | Speech recognition system of humanoid robot | |
CN108364662B (en) | Voice emotion recognition method and system based on paired identification tasks | |
CN102800316A (en) | Optimal codebook design method for voiceprint recognition system based on nerve network | |
CN113129867B (en) | Training method of voice recognition model, voice recognition method, device and equipment | |
WO2023030235A1 (en) | Target audio output method and system, readable storage medium, and electronic apparatus | |
Mistry et al. | Overview: Speech recognition technology, mel-frequency cepstral coefficients (mfcc), artificial neural network (ann) | |
US11611581B2 (en) | Methods and devices for detecting a spoofing attack | |
CN109215634A (en) | A kind of method and its system of more word voice control on-off systems | |
Wen et al. | The application of capsule neural network based cnn for speech emotion recognition | |
CN110853669A (en) | Audio identification method, device and equipment | |
CN110728993A (en) | Voice change identification method and electronic equipment | |
US11475876B2 (en) | Semantic recognition method and semantic recognition device | |
Sahoo et al. | MFCC feature with optimized frequency range: An essential step for emotion recognition | |
Brucal et al. | Female voice recognition using artificial neural networks and MATLAB voicebox toolbox |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |