CN106920558A - Keyword recognition method and device - Google Patents
Keyword recognition method and device Download PDFInfo
- Publication number
- CN106920558A CN106920558A CN201510993729.8A CN201510993729A CN106920558A CN 106920558 A CN106920558 A CN 106920558A CN 201510993729 A CN201510993729 A CN 201510993729A CN 106920558 A CN106920558 A CN 106920558A
- Authority
- CN
- China
- Prior art keywords
- identified
- voice data
- intermediate value
- keyword
- default
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000001228 spectrum Methods 0.000 claims abstract description 50
- 230000009466 transformation Effects 0.000 claims abstract description 9
- 230000003068 static effect Effects 0.000 claims description 8
- 230000001052 transient effect Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 claims description 4
- 230000001960 triggered effect Effects 0.000 claims description 3
- 241000638935 Senecio crassissimus Species 0.000 claims 1
- 230000008859 change Effects 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 241000208340 Araliaceae Species 0.000 description 4
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 4
- 235000003140 Panax quinquefolius Nutrition 0.000 description 4
- 235000008434 ginseng Nutrition 0.000 description 4
- 238000009432 framing Methods 0.000 description 3
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 3
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000002045 lasting effect Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/04—Training, enrolment or model building
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L17/00—Speaker identification or verification techniques
- G10L17/16—Hidden Markov models [HMM]
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
Keyword recognition method and device, methods described include:The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps;The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains corresponding spectrum energy;The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates corresponding MFCC parameters;According to the corresponding MFCC parameters of each voiced frame, the DTW between the voice data to be identified and default multiple reference templates is calculated respectively apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value;When it is determined that the DTW between the voice data to be identified and current reference template is less than default threshold value apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value, using the keyword in current reference template as recognition result.Above-mentioned scheme, can improve the accuracy rate of keyword identification, and save computing resource.
Description
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of keyword recognition method and device.
Background technology
Speech recognition is that the voice of people is converted to corresponding text or referred to by machine by identification and understanding process
The technology of order.As an important branch of field of speech recognition, keyword (Isolated Word
Recognition, IWR) identification obtained in fields such as communication, consumer electronics, Self-Service, office automations
To being widely applied.
In the prior art, typically using hidden Markov model (Hidden Markov Model, HMM)
Hidden Markov models (HMMs) and its corresponding parameter, or Keyword Spotting System (KWS)
Carry out keyword identification.
But, keyword recognition method needs to set up corresponding model in the prior art, and needs corresponding
Translating operation training pattern parameter, there is a problem that computationally intensive and recognition accuracy is low.
The content of the invention
The problem that the embodiment of the present invention is solved is to improve the accuracy rate of keyword identification, and saves computing resource.
To solve the above problems, a kind of keyword recognition method, the key are the embodiment of the invention provides
Word recognition method includes:
The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps;
The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains
To corresponding spectrum energy;
The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates right
The MFCC parameters answered;
According to the corresponding MFCC parameters of each voiced frame, the voice data to be identified is calculated respectively
With the DTW between default multiple reference templates in intermediate value, Euclidean distance intermediate value and cross-correlation distance
Value;
When it is determined that the DTW between the voice data to be identified and current reference template is apart from intermediate value, Euclidean
Apart from intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when, by current reference template
Keyword is used as recognition result.
Alternatively, when the spectrum energy of the voice data to be identified is more than default energy threshold, hold
The row spectrum energy be converted to each voiced frame corresponding spectrum energy under mel-frequency, and calculate
The operation of corresponding MFCC parameters.
Alternatively, the default threshold value is associated with the noise level of the voice data to be identified.
Alternatively, the noise level of the voice data to be identified includes low noise level, medium noise water
Gentle high noise levels, wherein:
As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know
The corresponding absolute amplitude of other voice data, p1 is default first threshold;
As p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is pre-
If Second Threshold, and p1 > p2;
As p < p2, determine that the voice data to be identified has high noise levels.
Alternatively, p1 is equal to 0.8, p2 and is equal to 0.45.
Alternatively, the reference template includes the abundant language of transient noise, static noise and particular person
The information of sound content.
The embodiment of the present invention additionally provides a kind of keyword identifying device, and described device includes:
Sub-frame processing unit, is suitable to be divided into the voice data to be identified for obtaining the sound of multiple overlaps
Frame;
Frequency domain converting unit, the voice signal for being suitable to the multiple voiced frames obtained to division is carried out quickly respectively
Fourier transformation computation, obtains corresponding spectrum energy;
First computing unit, is suitable to be converted to each voiced frame corresponding spectrum energy under mel-frequency
Spectrum energy, and calculate corresponding MFCC parameters;
Second computing unit, is suitable to, according to the corresponding MFCC parameters of each voiced frame, be calculated respectively
DTW between the voice data to be identified and default multiple reference templates is apart from intermediate value, Euclidean distance
Intermediate value and cross-correlation are apart from intermediate value;
Judging unit, be suitable to judge DTW between current sound frame and current reference template apart from intermediate value,
Whether Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value three;
Keyword recognition unit, is suitable to when between the determination voice data to be identified and current reference template
DTW apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average be less than default threshold value when,
Using the keyword in current reference template as recognition result.
Alternatively, also including trigger element, the trigger element is suitable in the voice data to be identified
When spectrum energy is more than default energy threshold, the first computing unit execution is triggered described by each sound
The corresponding spectrum energy of sound frame is converted to the spectrum energy under mel-frequency, and calculates corresponding MFCC ginsengs
Several operations.
Alternatively, the default threshold value is associated with the noise level of the voice data to be identified.
Alternatively, the noise level of the voice data to be identified includes low noise level, medium noise water
Gentle high noise levels, wherein:
As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know
The corresponding absolute amplitude of other voice data, p1 is default first threshold;
As p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is pre-
If Second Threshold, and p1 > p2;
As p < p2, determine that the voice data to be identified has high noise levels.
Alternatively, p1 is equal to 0.8, p2 and is equal to 0.45.
Alternatively, the reference template includes the abundant language of transient noise, static noise and particular person
The information of sound content.
Compared with prior art, technical scheme has the following advantages that:
Above-mentioned scheme, by the voice data to be identified and the ginseng that are calculated based on correspondence MFCC parameters
Examine DTW between template apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average with it is default
Threshold value be compared to determine whether include keyword in voiced frame, and corresponding mathematics need not be set up
Identification model, it is not required that translated accordingly to keyword, therefore, it can save keyword identification
Computing resource, it is possible to improve keyword identification accuracy rate.
Further, it is just right when the spectrum energy of voice data to be identified is more than default energy threshold
Corresponding voice data to be identified carries out keyword identification, conversely, not carried out to voice data to be identified then
Keyword recognizes, therefore, it can further save computing resource, and improves the speed of keyword identification.
Further, record corresponding reference template when, the reference template include transient noise,
The information of the abundant voice content of static noise and particular person so that reference template can be with corresponding spy
The affiliated environment of voice and voice for determining people is relatively accurately recorded, and be therefore, it can further improve and is closed
The accuracy of keyword identification.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the keyword recognition method in the embodiment of the present invention;
Fig. 2 is the flow chart of another keyword recognition method in the embodiment of the present invention;
Fig. 3 is a kind of structural representation of the keyword identifying device in the embodiment of the present invention.
Specific embodiment
To solve the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention is used passes through
It is determined that DTW between voice data to be identified and reference template is apart from intermediate value, Euclidean distance intermediate value and mutually
Whether the average of correlation distance intermediate value is compared to determine include key in voiced frame with default threshold value
Word, can save the computing resource of keyword identification, it is possible to improve the accuracy rate of keyword identification.
It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings
Specific embodiment of the invention is described in detail.
Fig. 1 shows a kind of flow chart of the keyword recognition method in the embodiment of the present invention.Such as Fig. 1 institutes
The keyword recognition method shown, may include steps of:
Step S101:The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps.
In specific implementation, the size of the lap between each voiced frame can be according to the actual needs
It is configured.For example, when the length of each voiced frame is 32ms, the overlapping portion between adjacent sound frame
The size divided can be 16ms.
Step S102:The voice signal of the multiple voiced frames obtained to division carries out fast Flourier change respectively
Computing is changed, corresponding spectrum energy is obtained.
In specific implementation, multiple voice signals that division is obtained are the voice signal of time domain, by quick
Fourier transformation computation (FFT), the voice signal of time domain can be converted to the voice signal of frequency domain.
Step S103:The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency,
And calculate corresponding MFCC parameters.
In specific implementation, the spectrum energy (work(of voice signal is obtained by quick Fourier transformation computation
Rate is composed), the spectrum energy under mel-frequency can be converted to according to default corresponding relation, and according to plum
Spectrum energy under your frequency, calculates the corresponding mel-frequency cepstrum coefficient (Mel of each voiced frame
Frequency Cepstrum Coefficient, MFCC) parameter.
Step S104:According to the corresponding MFCC parameters of each voiced frame, it is calculated described waits to know respectively
DTW between other voice data and default multiple reference templates is apart from intermediate value, Euclidean distance intermediate value and mutually
Correlation distance intermediate value.
In specific implementation, include respectively in default multiple reference templates in the voice of corresponding keyword
Hold.Wherein, the quantity of default reference template can be configured according to the actual needs, and the present invention exists
This is not limited.
Step S105:When it is determined that DTW between the voice data to be identified and current reference template away from
When from intermediate value, Euclidean distance intermediate value and with a distance from cross-correlation, the average of intermediate value is less than default threshold value, will be current
Keyword in reference template is used as recognition result.
In specific implementation, traveled through by default multiple reference templates, calculate currently treat respectively
DTW between identification voice data and current reference template is apart from intermediate value, Euclidean distance intermediate value and cross-correlation
Apart from intermediate value, and by current DTW between voice data to be identified and current reference template apart from intermediate value,
Euclidean distance intermediate value and cross-correlation are compared apart from the average of intermediate value with default threshold value, when it is determined that described
DTW between voice data to be identified and current reference template is apart from intermediate value, Euclidean distance intermediate value and mutually
Close apart from intermediate value average be less than default threshold value when, can using the keyword in current reference template as
Recognition result;Otherwise, it is determined that including in current reference template in current voice data to be identified
The voice messaging of keyword.
Further details of Jie is done to the keyword recognition method in the embodiment of the present invention below in conjunction with Fig. 2
Continue.
Fig. 2 shows the flow chart of another keyword recognition method in the embodiment of the present invention.Such as Fig. 2
Shown keyword recognition method, can include the steps:
Step S201:The voice data of acquisition is carried out into overlap framing, corresponding multiple voiced frames are obtained.
In specific implementation, analog-to-digital conversion can be carried out to the voice signal for being gathered, obtain correspondence first
Voice data.Then, corresponding voice data can be carried out overlap framing, obtains multiple voiced frames.
Voice data to gathering carries out framing, is substantially to carry out short-time analysis to voice data.Short-time analysis is
Voice signal is divided into the time short section with the fixed cycle, each time short section is relatively-stationary lasting
Sound clip.Wherein, partly overlapped between two adjacent voiced frames, overlapping range can be according to reality
Situation is selected.
Step S202:Windowing process is carried out to resulting multiple voiced frames.
In specific implementation, the Speech processings such as Hamming window, Hanning window, rectangular window can be selected to commonly use
Window function, frame length selection be 10~40ms, representative value is 20ms.Wherein, voice signal is divided
Frame treatment destroys the naturalness of voice signal, and adding window and return treatment etc. are carried out by using voiced frame,
This problem can be solved.
Step S203:Quick Fourier transformation computation will be carried out by the voiced frame after windowing process, obtained
The information of the corresponding spectrum energy of each voiced frame.
In specific implementation, voice data in theory for change over time, be a unstable state
Process, it is not possible to directly carry out the conversion of frequency domain.But, it is (short due to carrying out sub-frame processing to voice data
When analyze), per frame voice data may be considered it is metastable, thus can be applied to frequency domain turn
Change.
In specific implementation, short time discrete Fourier transform (Short-Time Fourier can be used
Transform/Short-Term Fourier Transform, STFT) frequency domain is carried out to the voice data of every frame
Conversion, to obtain the corresponding spectrum information of each voiced frame.Wherein, resulting frequency spectrum includes correspondence
Voice signal frequency and the relation of energy.
Step S204:The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency,
And calculate corresponding MFCC parameters.
In an embodiment of the present invention, when the multiple voiced frames for obtaining current voice data to be identified are corresponding
After spectrum energy, whether the spectrum energy of current voice data to be identified can be first determined whether more than default
Energy threshold, when it is determined that the spectrum energy of current voice data to be identified is more than the energy threshold,
Step S204 is continued executing with, otherwise, it determines the not voice including keyword in current voice data to be identified
Information, therefore, just can stop the subsequent treatment to current voice data to be identified, further to save
Computing resource.
In specific implementation, can be according to default corresponding relation, the frequency spectrum that will be obtained by FFT computings
Energy is converted into the spectrum energy under Mel (Mel) frequency, and calculates the corresponding MFCC of each voiced frame
Parameter, as the characteristic vector of each voiced frame.
Step S205:According to the corresponding MFCC parameters of each voiced frame, be calculated current sound frame with
DTW in default multiple reference templates between current reference template apart from intermediate value, Euclidean distance intermediate value and
Cross-correlation is apart from intermediate value.
In an embodiment of the present invention, calculating between current voice data to be identified and reference template
DTW apart from when, current voice data to be identified and reference template are divided into I frames respectively.Meanwhile,
Present inventor is rule of thumb known, in the recording process of reference template, the pronunciation meeting of speaker
Become excited, and word speed is also slow than usual.Therefore, reference template is divided into I frames, for DTW
The often jump size that distance is calculated is 0.1I frames, is being calculated the I frames and ginseng of current voice data to be identified
After the DTW distances of the I frames for examining template, using the I intermediate value of DTW distances as current sound to be identified
The DTW of sound data and corresponding reference template is apart from intermediate value.Similarly, we can obtain currently waiting to know
Euclidean distance (ED) intermediate value and cross-correlation distance (CC) of other voice data and corresponding reference template away from
From intermediate value.
Step S206:Judge DTW between voice data to be identified and current reference template apart from intermediate value,
Whether Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value;When judged result is
When being, step S207 can be performed, conversely, then to default multiple reference templates in next reference mould
Plate is performed since step S205.
In specific implementation, current DTW between voice data to be identified and reference template is being calculated
Apart from intermediate value, Euclidean distance intermediate value and cross-correlation after intermediate value, by the average of three and default threshold
Value is compared.
In an embodiment of the present invention, the noise water of the default threshold value and current voice data to be identified
Flat associated, i.e., different noise level, corresponding default threshold value will be different.Wherein, currently treat
Recognize that the absolute amplitude probability of voice data is more than as p >=p1, determine the voice data tool to be identified
There is a low noise level, p represents the corresponding absolute amplitude of the voice data to be identified, p1 is default the
One threshold value;During p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is
Default Second Threshold, and p1 > p2;As p < p2, determine that the voice data to be identified has height
Noise level.In an embodiment of the present invention, p1 is that 0.8, p2 is 0.45.
Step S207:Using the keyword in current reference template is as recognition result and exports.
In specific implementation, when it is determined that certain reference template in default reference template is to be identified with current
DTW between voice data is less than apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value
During default threshold value, it may be determined that current voice data to be identified includes the keyword in reference template
Voice messaging.Therefore, it can the keyword in the reference template as current voice data to be identified
Keyword recognition result and export.
In specific implementation, when above-mentioned keyword recognition method is applied in warning system, knowing
When not going out corresponding keyword, warning system can perform alarm operation.
It is to be herein pointed out in emergency or other keyword applications, it is simple (as not
It is trained) user can be used for recording personalized keyword.In order to ensure good recognition performance, ginseng
Examining template becomes extremely important.This can ensure the recording matter of reference template by simple verification operation
Amount.
Therefore, present inventor advocates three kinds of detection factors, that is, detect that door (is such as fallen in transient noise source
Sound), static noise source (such as fan or traffic noise), and enrich the pronunciation content of keyword.It is above-mentioned
Three kinds of factors need to meet simultaneously, will otherwise need to record keyword again.Wherein, the inspection of transient noise
Survey, it is possible to use the voiced frame of continuous 25ms, and often jump size is the exhausted of the energy of the voice signal of 5ms
To the difference of amplitude.Wherein it is possible to the absolute amplitude of every 5 voiced frames is carried out averagely.Made an uproar in static state
During sound detection, the recording of keyword occurs in default 5s time windows in quiet environment.With including key
The voice data of word is compared, in 5s time windows, the not beginning and end of the reference template including keyword
Signal energy there is larger difference.When abundant pronunciation content is verified, only single vowel and do not have
Keyword just like the consonant of " " etc is rejected, and this refusal can be based on and keyword
The related amendment zero-crossing rate of pronunciation content is made.
To do further details of to the corresponding device of keyword recognition method in the embodiment of the present invention below
Introduce.
Fig. 3 is referred to, the keyword identifying device 300 in the embodiment of the present invention can include sub-frame processing
Unit 301, frequency domain converting unit 302, the first computing unit 303, the second computing unit 304, judgement
Unit 305 and keyword recognition unit 306, wherein:
The sub-frame processing unit 301, is suitable to for the voice data to be identified for obtaining to be divided into multiple overlaps
Voiced frame;
The frequency domain converting unit 302, is suitable to travel through multiple voiced frames for obtaining of division, and will be all over
The voice signal of the current sound frame gone through carries out quick Fourier transformation computation, obtains corresponding frequency spectrum energy
Amount;
First computing unit 303, is suitable to resulting spectrum energy is converted to the frequency under mel-frequency
Spectrum energy, and calculate corresponding MFCC parameters;
In specific implementation, a trigger element (figure can also be set in the keyword identifying device 300
Not shown in), the trigger element is suitable to be more than default energy in the spectrum energy of the current sound frame for traversing
During amount threshold value, first computing unit 303 is triggered and performs described resulting spectrum energy is converted into Mel
Spectrum energy under frequency, and calculate the operation of corresponding MFCC parameters;
Second computing unit 304, is suitable to according to the corresponding MFCC parameters of current sound frame, difference
The DTW between current sound frame and default multiple reference templates is calculated apart from intermediate value, Euclidean distance
Intermediate value and cross-correlation are apart from intermediate value;
The judging unit 305, is suitable to judge in the DTW distances between current sound frame and reference template
Whether value, Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value;
In specific implementation, the default threshold value is associated with the noise level of current sound frame, wherein,
As p >=p1, determine that current sound frame has low noise level, p represents that current sound frame is corresponding definitely
Amplitude, p1 is default first threshold;As p2 >=p > p1, determine that current sound frame has medium making an uproar
Sound level, p2 is default Second Threshold, and p1 > p2;As p < p2, determine that current sound frame has
There are high noise levels.Wherein, in an embodiment of the present invention, p1 is equal to 0.8, p2 and is equal to 0.45.
In specific implementation, the reference template includes the rich of transient noise, static noise and particular person
The information of rich voice content.
The keyword recognition unit 306, is suitable to when between determination current sound frame and reference template
DTW apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average be less than default threshold value when,
Using the keyword in current reference template is as recognition result and exports.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment
Rapid to can be by program to instruct the hardware of correlation to complete, the program can be stored in computer-readable
In storage medium, storage medium can include:ROM, RAM, disk or CD etc..
The method and system to the embodiment of the present invention have been described in detail above, and the present invention is not limited thereto.
Any those skilled in the art, without departing from the spirit and scope of the present invention, can make it is various change with
Change, therefore protection scope of the present invention should be defined by claim limited range.
Claims (12)
1. a kind of keyword recognition method, it is characterised in that including:
The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps;
The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains
Corresponding spectrum energy;
The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates correspondence
MFCC parameters;
According to the corresponding MFCC parameters of each voiced frame, be calculated respectively the voice data to be identified with
DTW between default multiple reference templates is apart from intermediate value, Euclidean distance intermediate value and cross-correlation distance
Intermediate value;
When it is determined that the DTW between the voice data to be identified and current reference template is apart from intermediate value, Euclidean
Apart from intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when, by current reference template
Keyword as recognition result.
2. keyword recognition method according to claim 1, it is characterised in that in the sound to be identified
When the spectrum energy of data is more than default energy threshold, perform described by the corresponding frequency of each voiced frame
Spectrum energy is converted to the spectrum energy under mel-frequency, and calculates the operation of corresponding MFCC parameters.
3. keyword recognition method according to claim 1, it is characterised in that the default threshold value with
The noise level of the voice data to be identified is associated.
4. keyword recognition method according to claim 3, it is characterised in that the sound number to be identified
According to noise level include low noise level, medium noise level and high noise levels, wherein:
As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know
The corresponding absolute amplitude of other voice data, p1 is default first threshold;
As p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is default
Second Threshold, and p1 > p2;
As p < p2, determine that the voice data to be identified has high noise levels.
5. keyword recognition method according to claim 4, it is characterised in that p1 is equal to 0.8, p2 etc.
In 0.45.
6. keyword recognition method according to claim 1, it is characterised in that wrapped in the reference template
Include the information of the abundant voice content of transient noise, static noise and particular person.
7. a kind of keyword identifying device, it is characterised in that including:
Sub-frame processing unit, is suitable to be divided into the voice data to be identified for obtaining the voiced frame of multiple overlaps;
Frequency domain converting unit, the voice signal for being suitable to the multiple voiced frames obtained to division carries out quick Fu respectively
Vertical leaf transformation computing, obtains corresponding spectrum energy;
First computing unit, is suitable to the corresponding spectrum energy of each voiced frame is converted to the frequency under mel-frequency
Spectrum energy, and calculate corresponding MFCC parameters;
Second computing unit, is suitable to, according to the corresponding MFCC parameters of each voiced frame, institute is calculated respectively
State DTW between voice data to be identified and default multiple reference templates apart from intermediate value, Euclidean away from
The intermediate value with a distance from intermediate value and cross-correlation;
Judging unit, be suitable to judge DTW between current sound frame and current reference template apart from intermediate value,
Whether Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value three;
Keyword recognition unit, is suitable to when between the determination voice data to be identified and current reference template
DTW distances, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when,
Using the keyword in current reference template as recognition result.
8. keyword identifying device according to claim 7, it is characterised in that also including trigger element,
The trigger element is suitable to be more than default energy cut-off in the spectrum energy of the voice data to be identified
During value, the first computing unit execution is triggered described by the corresponding spectrum energy conversion of each voiced frame
It is the spectrum energy under mel-frequency, and calculates the operation of corresponding MFCC parameters.
9. keyword identifying device according to claim 7, it is characterised in that the default threshold value with
The noise level of the voice data to be identified is associated.
10. keyword identifying device according to claim 9, it is characterised in that the sound number to be identified
According to noise level include low noise level, medium noise level and high noise levels, wherein:
As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know
The corresponding absolute amplitude of other voice data, p1 is default first threshold;
As p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is default
Second Threshold, and p1 > p2;
As p < p2, determine that the voice data to be identified has high noise levels.
11. keyword identifying devices according to claim 10, it is characterised in that p1 is equal to 0.8, p2
Equal to 0.45.
12. keyword identifying devices according to claim 7, it is characterised in that wrapped in the reference template
Include the information of the abundant voice content of transient noise, static noise and particular person.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510993729.8A CN106920558B (en) | 2015-12-25 | 2015-12-25 | Keyword recognition method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510993729.8A CN106920558B (en) | 2015-12-25 | 2015-12-25 | Keyword recognition method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106920558A true CN106920558A (en) | 2017-07-04 |
CN106920558B CN106920558B (en) | 2021-04-13 |
Family
ID=59454658
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510993729.8A Active CN106920558B (en) | 2015-12-25 | 2015-12-25 | Keyword recognition method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106920558B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065043A (en) * | 2018-08-21 | 2018-12-21 | 广州市保伦电子有限公司 | A kind of order word recognition method and computer storage medium |
CN112765335A (en) * | 2021-01-27 | 2021-05-07 | 上海三菱电梯有限公司 | Voice calling landing system |
Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080037837A1 (en) * | 2004-05-21 | 2008-02-14 | Yoshihiro Noguchi | Behavior Content Classification Device |
CN101222703A (en) * | 2007-01-12 | 2008-07-16 | 杭州波导软件有限公司 | Identity verification method for mobile terminal based on voice identification |
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN102509547A (en) * | 2011-12-29 | 2012-06-20 | 辽宁工业大学 | Method and system for voiceprint recognition based on vector quantization based |
CN102687100A (en) * | 2010-01-06 | 2012-09-19 | 高通股份有限公司 | User interface methods and systems for providing force-sensitive input |
CN103021409A (en) * | 2012-11-13 | 2013-04-03 | 安徽科大讯飞信息科技股份有限公司 | Voice activating photographing system |
CN103065627A (en) * | 2012-12-17 | 2013-04-24 | 中南大学 | Identification method for horn of special vehicle based on dynamic time warping (DTW) and hidden markov model (HMM) evidence integration |
CN103854645A (en) * | 2014-03-05 | 2014-06-11 | 东南大学 | Speech emotion recognition method based on punishment of speaker and independent of speaker |
CN103971678A (en) * | 2013-01-29 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and device for detecting keywords |
CN104103272A (en) * | 2014-07-15 | 2014-10-15 | 无锡中星微电子有限公司 | Voice recognition method and device and blue-tooth earphone |
CN104103280A (en) * | 2014-07-15 | 2014-10-15 | 无锡中星微电子有限公司 | Dynamic time warping algorithm based voice activity detection method and device |
CN104778951A (en) * | 2015-04-07 | 2015-07-15 | 华为技术有限公司 | Speech enhancement method and device |
CN104978507A (en) * | 2014-04-14 | 2015-10-14 | 中国石油化工集团公司 | Intelligent well logging evaluation expert system identity authentication method based on voiceprint recognition |
-
2015
- 2015-12-25 CN CN201510993729.8A patent/CN106920558B/en active Active
Patent Citations (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080037837A1 (en) * | 2004-05-21 | 2008-02-14 | Yoshihiro Noguchi | Behavior Content Classification Device |
CN101222703A (en) * | 2007-01-12 | 2008-07-16 | 杭州波导软件有限公司 | Identity verification method for mobile terminal based on voice identification |
CN101599269A (en) * | 2009-07-02 | 2009-12-09 | 中国农业大学 | Sound end detecting method and device |
CN102687100A (en) * | 2010-01-06 | 2012-09-19 | 高通股份有限公司 | User interface methods and systems for providing force-sensitive input |
CN102509547A (en) * | 2011-12-29 | 2012-06-20 | 辽宁工业大学 | Method and system for voiceprint recognition based on vector quantization based |
CN103021409A (en) * | 2012-11-13 | 2013-04-03 | 安徽科大讯飞信息科技股份有限公司 | Voice activating photographing system |
CN103065627A (en) * | 2012-12-17 | 2013-04-24 | 中南大学 | Identification method for horn of special vehicle based on dynamic time warping (DTW) and hidden markov model (HMM) evidence integration |
CN103971678A (en) * | 2013-01-29 | 2014-08-06 | 腾讯科技(深圳)有限公司 | Method and device for detecting keywords |
CN103854645A (en) * | 2014-03-05 | 2014-06-11 | 东南大学 | Speech emotion recognition method based on punishment of speaker and independent of speaker |
CN104978507A (en) * | 2014-04-14 | 2015-10-14 | 中国石油化工集团公司 | Intelligent well logging evaluation expert system identity authentication method based on voiceprint recognition |
CN104103272A (en) * | 2014-07-15 | 2014-10-15 | 无锡中星微电子有限公司 | Voice recognition method and device and blue-tooth earphone |
CN104103280A (en) * | 2014-07-15 | 2014-10-15 | 无锡中星微电子有限公司 | Dynamic time warping algorithm based voice activity detection method and device |
CN104778951A (en) * | 2015-04-07 | 2015-07-15 | 华为技术有限公司 | Speech enhancement method and device |
Non-Patent Citations (5)
Title |
---|
ABHIJEET KUMAR: ""Voice Command Recognition system based on MFCC and DTW"", 《INTERNATIONAL JOURNAL OR ENGINEERING SCIENCE AND TECHNOLOGY》 * |
LINDASALWA: ""Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient and DTW techniques"", 《JOURNAL OF COMPUTING》 * |
刘志镜: ""加权DTW距离的自动步态识别"", 《中国图像图形学报》 * |
吴康妍: ""一种结合端点检测可检错的DTW乐谱跟随算法"", 《计算机应用与软件》 * |
赵晓慧: ""时间序列动态模糊聚类的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109065043A (en) * | 2018-08-21 | 2018-12-21 | 广州市保伦电子有限公司 | A kind of order word recognition method and computer storage medium |
CN109065043B (en) * | 2018-08-21 | 2022-07-05 | 广州市保伦电子有限公司 | Command word recognition method and computer storage medium |
CN112765335A (en) * | 2021-01-27 | 2021-05-07 | 上海三菱电梯有限公司 | Voice calling landing system |
CN112765335B (en) * | 2021-01-27 | 2024-03-08 | 上海三菱电梯有限公司 | Voice call system |
Also Published As
Publication number | Publication date |
---|---|
CN106920558B (en) | 2021-04-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR101988222B1 (en) | Apparatus and method for large vocabulary continuous speech recognition | |
KR102134201B1 (en) | Method, apparatus, and storage medium for constructing speech decoding network in numeric speech recognition | |
KR100679051B1 (en) | Apparatus and method for speech recognition using a plurality of confidence score estimation algorithms | |
US8271283B2 (en) | Method and apparatus for recognizing speech by measuring confidence levels of respective frames | |
Mantena et al. | Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping | |
Shahnawazuddin et al. | Pitch-Adaptive Front-End Features for Robust Children's ASR. | |
US20090313016A1 (en) | System and Method for Detecting Repeated Patterns in Dialog Systems | |
US20100161330A1 (en) | Speech models generated using competitive training, asymmetric training, and data boosting | |
Mouaz et al. | Speech recognition of moroccan dialect using hidden Markov models | |
US7177810B2 (en) | Method and apparatus for performing prosody-based endpointing of a speech signal | |
Vyas | A Gaussian mixture model based speech recognition system using Matlab | |
CN104103280B (en) | The method and apparatus of the offline speech terminals detection based on dynamic time consolidation algorithm | |
CN108091340B (en) | Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium | |
Abdo et al. | Automatic detection for some common pronunciation mistakes applied to chosen Quran sounds | |
Zehetner et al. | Wake-up-word spotting for mobile systems | |
Chadha et al. | Optimal feature extraction and selection techniques for speech processing: A review | |
CN106920558A (en) | Keyword recognition method and device | |
Zolnay et al. | Extraction methods of voicing feature for robust speech recognition. | |
Jung et al. | Selecting feature frames for automatic speaker recognition using mutual information | |
Yavuz et al. | A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model. | |
Li et al. | Voice-based recognition system for non-semantics information by language and gender | |
Sharma et al. | Speech recognition of Punjabi numerals using synergic HMM and DTW approach | |
Ishizuka et al. | A feature for voice activity detection derived from speech analysis with the exponential autoregressive model | |
Rahman et al. | Continuous bangla speech segmentation, classification and feature extraction | |
JP4576612B2 (en) | Speech recognition method and speech recognition apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |