CN106920558A - Keyword recognition method and device - Google Patents

Keyword recognition method and device Download PDF

Info

Publication number
CN106920558A
CN106920558A CN201510993729.8A CN201510993729A CN106920558A CN 106920558 A CN106920558 A CN 106920558A CN 201510993729 A CN201510993729 A CN 201510993729A CN 106920558 A CN106920558 A CN 106920558A
Authority
CN
China
Prior art keywords
identified
voice data
intermediate value
keyword
default
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510993729.8A
Other languages
Chinese (zh)
Other versions
CN106920558B (en
Inventor
孙廷玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Spreadtrum Communications Shanghai Co Ltd
Spreadtrum Communications Inc
Original Assignee
Spreadtrum Communications Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spreadtrum Communications Shanghai Co Ltd filed Critical Spreadtrum Communications Shanghai Co Ltd
Priority to CN201510993729.8A priority Critical patent/CN106920558B/en
Publication of CN106920558A publication Critical patent/CN106920558A/en
Application granted granted Critical
Publication of CN106920558B publication Critical patent/CN106920558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/16Hidden Markov models [HMM]

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Telephonic Communication Services (AREA)

Abstract

Keyword recognition method and device, methods described include:The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps;The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains corresponding spectrum energy;The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates corresponding MFCC parameters;According to the corresponding MFCC parameters of each voiced frame, the DTW between the voice data to be identified and default multiple reference templates is calculated respectively apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value;When it is determined that the DTW between the voice data to be identified and current reference template is less than default threshold value apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value, using the keyword in current reference template as recognition result.Above-mentioned scheme, can improve the accuracy rate of keyword identification, and save computing resource.

Description

Keyword recognition method and device
Technical field
The present invention relates to technical field of voice recognition, more particularly to a kind of keyword recognition method and device.
Background technology
Speech recognition is that the voice of people is converted to corresponding text or referred to by machine by identification and understanding process The technology of order.As an important branch of field of speech recognition, keyword (Isolated Word Recognition, IWR) identification obtained in fields such as communication, consumer electronics, Self-Service, office automations To being widely applied.
In the prior art, typically using hidden Markov model (Hidden Markov Model, HMM) Hidden Markov models (HMMs) and its corresponding parameter, or Keyword Spotting System (KWS) Carry out keyword identification.
But, keyword recognition method needs to set up corresponding model in the prior art, and needs corresponding Translating operation training pattern parameter, there is a problem that computationally intensive and recognition accuracy is low.
The content of the invention
The problem that the embodiment of the present invention is solved is to improve the accuracy rate of keyword identification, and saves computing resource.
To solve the above problems, a kind of keyword recognition method, the key are the embodiment of the invention provides Word recognition method includes:
The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps;
The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains To corresponding spectrum energy;
The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates right The MFCC parameters answered;
According to the corresponding MFCC parameters of each voiced frame, the voice data to be identified is calculated respectively With the DTW between default multiple reference templates in intermediate value, Euclidean distance intermediate value and cross-correlation distance Value;
When it is determined that the DTW between the voice data to be identified and current reference template is apart from intermediate value, Euclidean Apart from intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when, by current reference template Keyword is used as recognition result.
Alternatively, when the spectrum energy of the voice data to be identified is more than default energy threshold, hold The row spectrum energy be converted to each voiced frame corresponding spectrum energy under mel-frequency, and calculate The operation of corresponding MFCC parameters.
Alternatively, the default threshold value is associated with the noise level of the voice data to be identified.
Alternatively, the noise level of the voice data to be identified includes low noise level, medium noise water Gentle high noise levels, wherein:
As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know The corresponding absolute amplitude of other voice data, p1 is default first threshold;
As p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is pre- If Second Threshold, and p1 > p2;
As p < p2, determine that the voice data to be identified has high noise levels.
Alternatively, p1 is equal to 0.8, p2 and is equal to 0.45.
Alternatively, the reference template includes the abundant language of transient noise, static noise and particular person The information of sound content.
The embodiment of the present invention additionally provides a kind of keyword identifying device, and described device includes:
Sub-frame processing unit, is suitable to be divided into the voice data to be identified for obtaining the sound of multiple overlaps Frame;
Frequency domain converting unit, the voice signal for being suitable to the multiple voiced frames obtained to division is carried out quickly respectively Fourier transformation computation, obtains corresponding spectrum energy;
First computing unit, is suitable to be converted to each voiced frame corresponding spectrum energy under mel-frequency Spectrum energy, and calculate corresponding MFCC parameters;
Second computing unit, is suitable to, according to the corresponding MFCC parameters of each voiced frame, be calculated respectively DTW between the voice data to be identified and default multiple reference templates is apart from intermediate value, Euclidean distance Intermediate value and cross-correlation are apart from intermediate value;
Judging unit, be suitable to judge DTW between current sound frame and current reference template apart from intermediate value, Whether Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value three;
Keyword recognition unit, is suitable to when between the determination voice data to be identified and current reference template DTW apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average be less than default threshold value when, Using the keyword in current reference template as recognition result.
Alternatively, also including trigger element, the trigger element is suitable in the voice data to be identified When spectrum energy is more than default energy threshold, the first computing unit execution is triggered described by each sound The corresponding spectrum energy of sound frame is converted to the spectrum energy under mel-frequency, and calculates corresponding MFCC ginsengs Several operations.
Alternatively, the default threshold value is associated with the noise level of the voice data to be identified.
Alternatively, the noise level of the voice data to be identified includes low noise level, medium noise water Gentle high noise levels, wherein:
As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know The corresponding absolute amplitude of other voice data, p1 is default first threshold;
As p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is pre- If Second Threshold, and p1 > p2;
As p < p2, determine that the voice data to be identified has high noise levels.
Alternatively, p1 is equal to 0.8, p2 and is equal to 0.45.
Alternatively, the reference template includes the abundant language of transient noise, static noise and particular person The information of sound content.
Compared with prior art, technical scheme has the following advantages that:
Above-mentioned scheme, by the voice data to be identified and the ginseng that are calculated based on correspondence MFCC parameters Examine DTW between template apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average with it is default Threshold value be compared to determine whether include keyword in voiced frame, and corresponding mathematics need not be set up Identification model, it is not required that translated accordingly to keyword, therefore, it can save keyword identification Computing resource, it is possible to improve keyword identification accuracy rate.
Further, it is just right when the spectrum energy of voice data to be identified is more than default energy threshold Corresponding voice data to be identified carries out keyword identification, conversely, not carried out to voice data to be identified then Keyword recognizes, therefore, it can further save computing resource, and improves the speed of keyword identification.
Further, record corresponding reference template when, the reference template include transient noise, The information of the abundant voice content of static noise and particular person so that reference template can be with corresponding spy The affiliated environment of voice and voice for determining people is relatively accurately recorded, and be therefore, it can further improve and is closed The accuracy of keyword identification.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the keyword recognition method in the embodiment of the present invention;
Fig. 2 is the flow chart of another keyword recognition method in the embodiment of the present invention;
Fig. 3 is a kind of structural representation of the keyword identifying device in the embodiment of the present invention.
Specific embodiment
To solve the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention is used passes through It is determined that DTW between voice data to be identified and reference template is apart from intermediate value, Euclidean distance intermediate value and mutually Whether the average of correlation distance intermediate value is compared to determine include key in voiced frame with default threshold value Word, can save the computing resource of keyword identification, it is possible to improve the accuracy rate of keyword identification.
It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings Specific embodiment of the invention is described in detail.
Fig. 1 shows a kind of flow chart of the keyword recognition method in the embodiment of the present invention.Such as Fig. 1 institutes The keyword recognition method shown, may include steps of:
Step S101:The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps.
In specific implementation, the size of the lap between each voiced frame can be according to the actual needs It is configured.For example, when the length of each voiced frame is 32ms, the overlapping portion between adjacent sound frame The size divided can be 16ms.
Step S102:The voice signal of the multiple voiced frames obtained to division carries out fast Flourier change respectively Computing is changed, corresponding spectrum energy is obtained.
In specific implementation, multiple voice signals that division is obtained are the voice signal of time domain, by quick Fourier transformation computation (FFT), the voice signal of time domain can be converted to the voice signal of frequency domain.
Step S103:The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, And calculate corresponding MFCC parameters.
In specific implementation, the spectrum energy (work(of voice signal is obtained by quick Fourier transformation computation Rate is composed), the spectrum energy under mel-frequency can be converted to according to default corresponding relation, and according to plum Spectrum energy under your frequency, calculates the corresponding mel-frequency cepstrum coefficient (Mel of each voiced frame Frequency Cepstrum Coefficient, MFCC) parameter.
Step S104:According to the corresponding MFCC parameters of each voiced frame, it is calculated described waits to know respectively DTW between other voice data and default multiple reference templates is apart from intermediate value, Euclidean distance intermediate value and mutually Correlation distance intermediate value.
In specific implementation, include respectively in default multiple reference templates in the voice of corresponding keyword Hold.Wherein, the quantity of default reference template can be configured according to the actual needs, and the present invention exists This is not limited.
Step S105:When it is determined that DTW between the voice data to be identified and current reference template away from When from intermediate value, Euclidean distance intermediate value and with a distance from cross-correlation, the average of intermediate value is less than default threshold value, will be current Keyword in reference template is used as recognition result.
In specific implementation, traveled through by default multiple reference templates, calculate currently treat respectively DTW between identification voice data and current reference template is apart from intermediate value, Euclidean distance intermediate value and cross-correlation Apart from intermediate value, and by current DTW between voice data to be identified and current reference template apart from intermediate value, Euclidean distance intermediate value and cross-correlation are compared apart from the average of intermediate value with default threshold value, when it is determined that described DTW between voice data to be identified and current reference template is apart from intermediate value, Euclidean distance intermediate value and mutually Close apart from intermediate value average be less than default threshold value when, can using the keyword in current reference template as Recognition result;Otherwise, it is determined that including in current reference template in current voice data to be identified The voice messaging of keyword.
Further details of Jie is done to the keyword recognition method in the embodiment of the present invention below in conjunction with Fig. 2 Continue.
Fig. 2 shows the flow chart of another keyword recognition method in the embodiment of the present invention.Such as Fig. 2 Shown keyword recognition method, can include the steps:
Step S201:The voice data of acquisition is carried out into overlap framing, corresponding multiple voiced frames are obtained.
In specific implementation, analog-to-digital conversion can be carried out to the voice signal for being gathered, obtain correspondence first Voice data.Then, corresponding voice data can be carried out overlap framing, obtains multiple voiced frames. Voice data to gathering carries out framing, is substantially to carry out short-time analysis to voice data.Short-time analysis is Voice signal is divided into the time short section with the fixed cycle, each time short section is relatively-stationary lasting Sound clip.Wherein, partly overlapped between two adjacent voiced frames, overlapping range can be according to reality Situation is selected.
Step S202:Windowing process is carried out to resulting multiple voiced frames.
In specific implementation, the Speech processings such as Hamming window, Hanning window, rectangular window can be selected to commonly use Window function, frame length selection be 10~40ms, representative value is 20ms.Wherein, voice signal is divided Frame treatment destroys the naturalness of voice signal, and adding window and return treatment etc. are carried out by using voiced frame, This problem can be solved.
Step S203:Quick Fourier transformation computation will be carried out by the voiced frame after windowing process, obtained The information of the corresponding spectrum energy of each voiced frame.
In specific implementation, voice data in theory for change over time, be a unstable state Process, it is not possible to directly carry out the conversion of frequency domain.But, it is (short due to carrying out sub-frame processing to voice data When analyze), per frame voice data may be considered it is metastable, thus can be applied to frequency domain turn Change.
In specific implementation, short time discrete Fourier transform (Short-Time Fourier can be used Transform/Short-Term Fourier Transform, STFT) frequency domain is carried out to the voice data of every frame Conversion, to obtain the corresponding spectrum information of each voiced frame.Wherein, resulting frequency spectrum includes correspondence Voice signal frequency and the relation of energy.
Step S204:The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, And calculate corresponding MFCC parameters.
In an embodiment of the present invention, when the multiple voiced frames for obtaining current voice data to be identified are corresponding After spectrum energy, whether the spectrum energy of current voice data to be identified can be first determined whether more than default Energy threshold, when it is determined that the spectrum energy of current voice data to be identified is more than the energy threshold, Step S204 is continued executing with, otherwise, it determines the not voice including keyword in current voice data to be identified Information, therefore, just can stop the subsequent treatment to current voice data to be identified, further to save Computing resource.
In specific implementation, can be according to default corresponding relation, the frequency spectrum that will be obtained by FFT computings Energy is converted into the spectrum energy under Mel (Mel) frequency, and calculates the corresponding MFCC of each voiced frame Parameter, as the characteristic vector of each voiced frame.
Step S205:According to the corresponding MFCC parameters of each voiced frame, be calculated current sound frame with DTW in default multiple reference templates between current reference template apart from intermediate value, Euclidean distance intermediate value and Cross-correlation is apart from intermediate value.
In an embodiment of the present invention, calculating between current voice data to be identified and reference template DTW apart from when, current voice data to be identified and reference template are divided into I frames respectively.Meanwhile, Present inventor is rule of thumb known, in the recording process of reference template, the pronunciation meeting of speaker Become excited, and word speed is also slow than usual.Therefore, reference template is divided into I frames, for DTW The often jump size that distance is calculated is 0.1I frames, is being calculated the I frames and ginseng of current voice data to be identified After the DTW distances of the I frames for examining template, using the I intermediate value of DTW distances as current sound to be identified The DTW of sound data and corresponding reference template is apart from intermediate value.Similarly, we can obtain currently waiting to know Euclidean distance (ED) intermediate value and cross-correlation distance (CC) of other voice data and corresponding reference template away from From intermediate value.
Step S206:Judge DTW between voice data to be identified and current reference template apart from intermediate value, Whether Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value;When judged result is When being, step S207 can be performed, conversely, then to default multiple reference templates in next reference mould Plate is performed since step S205.
In specific implementation, current DTW between voice data to be identified and reference template is being calculated Apart from intermediate value, Euclidean distance intermediate value and cross-correlation after intermediate value, by the average of three and default threshold Value is compared.
In an embodiment of the present invention, the noise water of the default threshold value and current voice data to be identified Flat associated, i.e., different noise level, corresponding default threshold value will be different.Wherein, currently treat Recognize that the absolute amplitude probability of voice data is more than as p >=p1, determine the voice data tool to be identified There is a low noise level, p represents the corresponding absolute amplitude of the voice data to be identified, p1 is default the One threshold value;During p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is Default Second Threshold, and p1 > p2;As p < p2, determine that the voice data to be identified has height Noise level.In an embodiment of the present invention, p1 is that 0.8, p2 is 0.45.
Step S207:Using the keyword in current reference template is as recognition result and exports.
In specific implementation, when it is determined that certain reference template in default reference template is to be identified with current DTW between voice data is less than apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value During default threshold value, it may be determined that current voice data to be identified includes the keyword in reference template Voice messaging.Therefore, it can the keyword in the reference template as current voice data to be identified Keyword recognition result and export.
In specific implementation, when above-mentioned keyword recognition method is applied in warning system, knowing When not going out corresponding keyword, warning system can perform alarm operation.
It is to be herein pointed out in emergency or other keyword applications, it is simple (as not It is trained) user can be used for recording personalized keyword.In order to ensure good recognition performance, ginseng Examining template becomes extremely important.This can ensure the recording matter of reference template by simple verification operation Amount.
Therefore, present inventor advocates three kinds of detection factors, that is, detect that door (is such as fallen in transient noise source Sound), static noise source (such as fan or traffic noise), and enrich the pronunciation content of keyword.It is above-mentioned Three kinds of factors need to meet simultaneously, will otherwise need to record keyword again.Wherein, the inspection of transient noise Survey, it is possible to use the voiced frame of continuous 25ms, and often jump size is the exhausted of the energy of the voice signal of 5ms To the difference of amplitude.Wherein it is possible to the absolute amplitude of every 5 voiced frames is carried out averagely.Made an uproar in static state During sound detection, the recording of keyword occurs in default 5s time windows in quiet environment.With including key The voice data of word is compared, in 5s time windows, the not beginning and end of the reference template including keyword Signal energy there is larger difference.When abundant pronunciation content is verified, only single vowel and do not have Keyword just like the consonant of " " etc is rejected, and this refusal can be based on and keyword The related amendment zero-crossing rate of pronunciation content is made.
To do further details of to the corresponding device of keyword recognition method in the embodiment of the present invention below Introduce.
Fig. 3 is referred to, the keyword identifying device 300 in the embodiment of the present invention can include sub-frame processing Unit 301, frequency domain converting unit 302, the first computing unit 303, the second computing unit 304, judgement Unit 305 and keyword recognition unit 306, wherein:
The sub-frame processing unit 301, is suitable to for the voice data to be identified for obtaining to be divided into multiple overlaps Voiced frame;
The frequency domain converting unit 302, is suitable to travel through multiple voiced frames for obtaining of division, and will be all over The voice signal of the current sound frame gone through carries out quick Fourier transformation computation, obtains corresponding frequency spectrum energy Amount;
First computing unit 303, is suitable to resulting spectrum energy is converted to the frequency under mel-frequency Spectrum energy, and calculate corresponding MFCC parameters;
In specific implementation, a trigger element (figure can also be set in the keyword identifying device 300 Not shown in), the trigger element is suitable to be more than default energy in the spectrum energy of the current sound frame for traversing During amount threshold value, first computing unit 303 is triggered and performs described resulting spectrum energy is converted into Mel Spectrum energy under frequency, and calculate the operation of corresponding MFCC parameters;
Second computing unit 304, is suitable to according to the corresponding MFCC parameters of current sound frame, difference The DTW between current sound frame and default multiple reference templates is calculated apart from intermediate value, Euclidean distance Intermediate value and cross-correlation are apart from intermediate value;
The judging unit 305, is suitable to judge in the DTW distances between current sound frame and reference template Whether value, Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value;
In specific implementation, the default threshold value is associated with the noise level of current sound frame, wherein, As p >=p1, determine that current sound frame has low noise level, p represents that current sound frame is corresponding definitely Amplitude, p1 is default first threshold;As p2 >=p > p1, determine that current sound frame has medium making an uproar Sound level, p2 is default Second Threshold, and p1 > p2;As p < p2, determine that current sound frame has There are high noise levels.Wherein, in an embodiment of the present invention, p1 is equal to 0.8, p2 and is equal to 0.45.
In specific implementation, the reference template includes the rich of transient noise, static noise and particular person The information of rich voice content.
The keyword recognition unit 306, is suitable to when between determination current sound frame and reference template DTW apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average be less than default threshold value when, Using the keyword in current reference template is as recognition result and exports.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment Rapid to can be by program to instruct the hardware of correlation to complete, the program can be stored in computer-readable In storage medium, storage medium can include:ROM, RAM, disk or CD etc..
The method and system to the embodiment of the present invention have been described in detail above, and the present invention is not limited thereto. Any those skilled in the art, without departing from the spirit and scope of the present invention, can make it is various change with Change, therefore protection scope of the present invention should be defined by claim limited range.

Claims (12)

1. a kind of keyword recognition method, it is characterised in that including:
The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps;
The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains Corresponding spectrum energy;
The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates correspondence MFCC parameters;
According to the corresponding MFCC parameters of each voiced frame, be calculated respectively the voice data to be identified with DTW between default multiple reference templates is apart from intermediate value, Euclidean distance intermediate value and cross-correlation distance Intermediate value;
When it is determined that the DTW between the voice data to be identified and current reference template is apart from intermediate value, Euclidean Apart from intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when, by current reference template Keyword as recognition result.
2. keyword recognition method according to claim 1, it is characterised in that in the sound to be identified When the spectrum energy of data is more than default energy threshold, perform described by the corresponding frequency of each voiced frame Spectrum energy is converted to the spectrum energy under mel-frequency, and calculates the operation of corresponding MFCC parameters.
3. keyword recognition method according to claim 1, it is characterised in that the default threshold value with The noise level of the voice data to be identified is associated.
4. keyword recognition method according to claim 3, it is characterised in that the sound number to be identified According to noise level include low noise level, medium noise level and high noise levels, wherein:
As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know The corresponding absolute amplitude of other voice data, p1 is default first threshold;
As p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is default Second Threshold, and p1 > p2;
As p < p2, determine that the voice data to be identified has high noise levels.
5. keyword recognition method according to claim 4, it is characterised in that p1 is equal to 0.8, p2 etc. In 0.45.
6. keyword recognition method according to claim 1, it is characterised in that wrapped in the reference template Include the information of the abundant voice content of transient noise, static noise and particular person.
7. a kind of keyword identifying device, it is characterised in that including:
Sub-frame processing unit, is suitable to be divided into the voice data to be identified for obtaining the voiced frame of multiple overlaps;
Frequency domain converting unit, the voice signal for being suitable to the multiple voiced frames obtained to division carries out quick Fu respectively Vertical leaf transformation computing, obtains corresponding spectrum energy;
First computing unit, is suitable to the corresponding spectrum energy of each voiced frame is converted to the frequency under mel-frequency Spectrum energy, and calculate corresponding MFCC parameters;
Second computing unit, is suitable to, according to the corresponding MFCC parameters of each voiced frame, institute is calculated respectively State DTW between voice data to be identified and default multiple reference templates apart from intermediate value, Euclidean away from The intermediate value with a distance from intermediate value and cross-correlation;
Judging unit, be suitable to judge DTW between current sound frame and current reference template apart from intermediate value, Whether Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value three;
Keyword recognition unit, is suitable to when between the determination voice data to be identified and current reference template DTW distances, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when, Using the keyword in current reference template as recognition result.
8. keyword identifying device according to claim 7, it is characterised in that also including trigger element, The trigger element is suitable to be more than default energy cut-off in the spectrum energy of the voice data to be identified During value, the first computing unit execution is triggered described by the corresponding spectrum energy conversion of each voiced frame It is the spectrum energy under mel-frequency, and calculates the operation of corresponding MFCC parameters.
9. keyword identifying device according to claim 7, it is characterised in that the default threshold value with The noise level of the voice data to be identified is associated.
10. keyword identifying device according to claim 9, it is characterised in that the sound number to be identified According to noise level include low noise level, medium noise level and high noise levels, wherein:
As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know The corresponding absolute amplitude of other voice data, p1 is default first threshold;
As p2 >=p > p1, determine that the voice data to be identified has medium noise level, p2 is default Second Threshold, and p1 > p2;
As p < p2, determine that the voice data to be identified has high noise levels.
11. keyword identifying devices according to claim 10, it is characterised in that p1 is equal to 0.8, p2 Equal to 0.45.
12. keyword identifying devices according to claim 7, it is characterised in that wrapped in the reference template Include the information of the abundant voice content of transient noise, static noise and particular person.
CN201510993729.8A 2015-12-25 2015-12-25 Keyword recognition method and device Active CN106920558B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510993729.8A CN106920558B (en) 2015-12-25 2015-12-25 Keyword recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510993729.8A CN106920558B (en) 2015-12-25 2015-12-25 Keyword recognition method and device

Publications (2)

Publication Number Publication Date
CN106920558A true CN106920558A (en) 2017-07-04
CN106920558B CN106920558B (en) 2021-04-13

Family

ID=59454658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510993729.8A Active CN106920558B (en) 2015-12-25 2015-12-25 Keyword recognition method and device

Country Status (1)

Country Link
CN (1) CN106920558B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065043A (en) * 2018-08-21 2018-12-21 广州市保伦电子有限公司 A kind of order word recognition method and computer storage medium
CN112765335A (en) * 2021-01-27 2021-05-07 上海三菱电梯有限公司 Voice calling landing system

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080037837A1 (en) * 2004-05-21 2008-02-14 Yoshihiro Noguchi Behavior Content Classification Device
CN101222703A (en) * 2007-01-12 2008-07-16 杭州波导软件有限公司 Identity verification method for mobile terminal based on voice identification
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN102509547A (en) * 2011-12-29 2012-06-20 辽宁工业大学 Method and system for voiceprint recognition based on vector quantization based
CN102687100A (en) * 2010-01-06 2012-09-19 高通股份有限公司 User interface methods and systems for providing force-sensitive input
CN103021409A (en) * 2012-11-13 2013-04-03 安徽科大讯飞信息科技股份有限公司 Voice activating photographing system
CN103065627A (en) * 2012-12-17 2013-04-24 中南大学 Identification method for horn of special vehicle based on dynamic time warping (DTW) and hidden markov model (HMM) evidence integration
CN103854645A (en) * 2014-03-05 2014-06-11 东南大学 Speech emotion recognition method based on punishment of speaker and independent of speaker
CN103971678A (en) * 2013-01-29 2014-08-06 腾讯科技(深圳)有限公司 Method and device for detecting keywords
CN104103272A (en) * 2014-07-15 2014-10-15 无锡中星微电子有限公司 Voice recognition method and device and blue-tooth earphone
CN104103280A (en) * 2014-07-15 2014-10-15 无锡中星微电子有限公司 Dynamic time warping algorithm based voice activity detection method and device
CN104778951A (en) * 2015-04-07 2015-07-15 华为技术有限公司 Speech enhancement method and device
CN104978507A (en) * 2014-04-14 2015-10-14 中国石油化工集团公司 Intelligent well logging evaluation expert system identity authentication method based on voiceprint recognition

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080037837A1 (en) * 2004-05-21 2008-02-14 Yoshihiro Noguchi Behavior Content Classification Device
CN101222703A (en) * 2007-01-12 2008-07-16 杭州波导软件有限公司 Identity verification method for mobile terminal based on voice identification
CN101599269A (en) * 2009-07-02 2009-12-09 中国农业大学 Sound end detecting method and device
CN102687100A (en) * 2010-01-06 2012-09-19 高通股份有限公司 User interface methods and systems for providing force-sensitive input
CN102509547A (en) * 2011-12-29 2012-06-20 辽宁工业大学 Method and system for voiceprint recognition based on vector quantization based
CN103021409A (en) * 2012-11-13 2013-04-03 安徽科大讯飞信息科技股份有限公司 Voice activating photographing system
CN103065627A (en) * 2012-12-17 2013-04-24 中南大学 Identification method for horn of special vehicle based on dynamic time warping (DTW) and hidden markov model (HMM) evidence integration
CN103971678A (en) * 2013-01-29 2014-08-06 腾讯科技(深圳)有限公司 Method and device for detecting keywords
CN103854645A (en) * 2014-03-05 2014-06-11 东南大学 Speech emotion recognition method based on punishment of speaker and independent of speaker
CN104978507A (en) * 2014-04-14 2015-10-14 中国石油化工集团公司 Intelligent well logging evaluation expert system identity authentication method based on voiceprint recognition
CN104103272A (en) * 2014-07-15 2014-10-15 无锡中星微电子有限公司 Voice recognition method and device and blue-tooth earphone
CN104103280A (en) * 2014-07-15 2014-10-15 无锡中星微电子有限公司 Dynamic time warping algorithm based voice activity detection method and device
CN104778951A (en) * 2015-04-07 2015-07-15 华为技术有限公司 Speech enhancement method and device

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
ABHIJEET KUMAR: ""Voice Command Recognition system based on MFCC and DTW"", 《INTERNATIONAL JOURNAL OR ENGINEERING SCIENCE AND TECHNOLOGY》 *
LINDASALWA: ""Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient and DTW techniques"", 《JOURNAL OF COMPUTING》 *
刘志镜: ""加权DTW距离的自动步态识别"", 《中国图像图形学报》 *
吴康妍: ""一种结合端点检测可检错的DTW乐谱跟随算法"", 《计算机应用与软件》 *
赵晓慧: ""时间序列动态模糊聚类的研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109065043A (en) * 2018-08-21 2018-12-21 广州市保伦电子有限公司 A kind of order word recognition method and computer storage medium
CN109065043B (en) * 2018-08-21 2022-07-05 广州市保伦电子有限公司 Command word recognition method and computer storage medium
CN112765335A (en) * 2021-01-27 2021-05-07 上海三菱电梯有限公司 Voice calling landing system
CN112765335B (en) * 2021-01-27 2024-03-08 上海三菱电梯有限公司 Voice call system

Also Published As

Publication number Publication date
CN106920558B (en) 2021-04-13

Similar Documents

Publication Publication Date Title
KR101988222B1 (en) Apparatus and method for large vocabulary continuous speech recognition
KR102134201B1 (en) Method, apparatus, and storage medium for constructing speech decoding network in numeric speech recognition
KR100679051B1 (en) Apparatus and method for speech recognition using a plurality of confidence score estimation algorithms
US8271283B2 (en) Method and apparatus for recognizing speech by measuring confidence levels of respective frames
Mantena et al. Query-by-example spoken term detection using frequency domain linear prediction and non-segmental dynamic time warping
Shahnawazuddin et al. Pitch-Adaptive Front-End Features for Robust Children's ASR.
US20090313016A1 (en) System and Method for Detecting Repeated Patterns in Dialog Systems
US20100161330A1 (en) Speech models generated using competitive training, asymmetric training, and data boosting
Mouaz et al. Speech recognition of moroccan dialect using hidden Markov models
US7177810B2 (en) Method and apparatus for performing prosody-based endpointing of a speech signal
Vyas A Gaussian mixture model based speech recognition system using Matlab
CN104103280B (en) The method and apparatus of the offline speech terminals detection based on dynamic time consolidation algorithm
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
Abdo et al. Automatic detection for some common pronunciation mistakes applied to chosen Quran sounds
Zehetner et al. Wake-up-word spotting for mobile systems
Chadha et al. Optimal feature extraction and selection techniques for speech processing: A review
CN106920558A (en) Keyword recognition method and device
Zolnay et al. Extraction methods of voicing feature for robust speech recognition.
Jung et al. Selecting feature frames for automatic speaker recognition using mutual information
Yavuz et al. A Phoneme-Based Approach for Eliminating Out-of-vocabulary Problem Turkish Speech Recognition Using Hidden Markov Model.
Li et al. Voice-based recognition system for non-semantics information by language and gender
Sharma et al. Speech recognition of Punjabi numerals using synergic HMM and DTW approach
Ishizuka et al. A feature for voice activity detection derived from speech analysis with the exponential autoregressive model
Rahman et al. Continuous bangla speech segmentation, classification and feature extraction
JP4576612B2 (en) Speech recognition method and speech recognition apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant