CN106920558A

CN106920558A - Keyword recognition method and device

Info

Publication number: CN106920558A
Application number: CN201510993729.8A
Authority: CN
Inventors: 孙廷玮
Original assignee: Spreadtrum Communications Shanghai Co Ltd
Current assignee: Spreadtrum Communications Shanghai Co Ltd; Spreadtrum Communications Inc
Priority date: 2015-12-25
Filing date: 2015-12-25
Publication date: 2017-07-04
Anticipated expiration: 2035-12-25
Also published as: CN106920558B

Abstract

Keyword recognition method and device, methods described include：The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps；The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains corresponding spectrum energy；The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates corresponding MFCC parameters；According to the corresponding MFCC parameters of each voiced frame, the DTW between the voice data to be identified and default multiple reference templates is calculated respectively apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value；When it is determined that the DTW between the voice data to be identified and current reference template is less than default threshold value apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value, using the keyword in current reference template as recognition result.Above-mentioned scheme, can improve the accuracy rate of keyword identification, and save computing resource.

Description

Keyword recognition method and device

Technical field

The present invention relates to technical field of voice recognition, more particularly to a kind of keyword recognition method and device.

Background technology

Speech recognition is that the voice of people is converted to corresponding text or referred to by machine by identification and understanding process The technology of order.As an important branch of field of speech recognition, keyword (Isolated Word Recognition, IWR) identification obtained in fields such as communication, consumer electronics, Self-Service, office automations To being widely applied.

In the prior art, typically using hidden Markov model (Hidden Markov Model, HMM) Hidden Markov models (HMMs) and its corresponding parameter, or Keyword Spotting System (KWS) Carry out keyword identification.

But, keyword recognition method needs to set up corresponding model in the prior art, and needs corresponding Translating operation training pattern parameter, there is a problem that computationally intensive and recognition accuracy is low.

The content of the invention

The problem that the embodiment of the present invention is solved is to improve the accuracy rate of keyword identification, and saves computing resource.

To solve the above problems, a kind of keyword recognition method, the key are the embodiment of the invention provides Word recognition method includes：

The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps；

The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains To corresponding spectrum energy；

The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates right The MFCC parameters answered；

According to the corresponding MFCC parameters of each voiced frame, the voice data to be identified is calculated respectively With the DTW between default multiple reference templates in intermediate value, Euclidean distance intermediate value and cross-correlation distance Value；

When it is determined that the DTW between the voice data to be identified and current reference template is apart from intermediate value, Euclidean Apart from intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when, by current reference template Keyword is used as recognition result.

Alternatively, when the spectrum energy of the voice data to be identified is more than default energy threshold, hold The row spectrum energy be converted to each voiced frame corresponding spectrum energy under mel-frequency, and calculate The operation of corresponding MFCC parameters.

Alternatively, the default threshold value is associated with the noise level of the voice data to be identified.

Alternatively, the noise level of the voice data to be identified includes low noise level, medium noise water Gentle high noise levels, wherein：

As p >=p1, determine that the voice data to be identified has low noise level, p represents described and waits to know The corresponding absolute amplitude of other voice data, p1 is default first threshold；

As p2 >=p ＞ p1, determine that the voice data to be identified has medium noise level, p2 is pre- If Second Threshold, and p1 ＞ p2；

As p ＜ p2, determine that the voice data to be identified has high noise levels.

Alternatively, p1 is equal to 0.8, p2 and is equal to 0.45.

Alternatively, the reference template includes the abundant language of transient noise, static noise and particular person The information of sound content.

The embodiment of the present invention additionally provides a kind of keyword identifying device, and described device includes：

Sub-frame processing unit, is suitable to be divided into the voice data to be identified for obtaining the sound of multiple overlaps Frame；

Frequency domain converting unit, the voice signal for being suitable to the multiple voiced frames obtained to division is carried out quickly respectively Fourier transformation computation, obtains corresponding spectrum energy；

First computing unit, is suitable to be converted to each voiced frame corresponding spectrum energy under mel-frequency Spectrum energy, and calculate corresponding MFCC parameters；

Second computing unit, is suitable to, according to the corresponding MFCC parameters of each voiced frame, be calculated respectively DTW between the voice data to be identified and default multiple reference templates is apart from intermediate value, Euclidean distance Intermediate value and cross-correlation are apart from intermediate value；

Judging unit, be suitable to judge DTW between current sound frame and current reference template apart from intermediate value, Whether Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value three；

Keyword recognition unit, is suitable to when between the determination voice data to be identified and current reference template DTW apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average be less than default threshold value when, Using the keyword in current reference template as recognition result.

Alternatively, also including trigger element, the trigger element is suitable in the voice data to be identified When spectrum energy is more than default energy threshold, the first computing unit execution is triggered described by each sound The corresponding spectrum energy of sound frame is converted to the spectrum energy under mel-frequency, and calculates corresponding MFCC ginsengs Several operations.

Alternatively, p1 is equal to 0.8, p2 and is equal to 0.45.

Compared with prior art, technical scheme has the following advantages that：

Above-mentioned scheme, by the voice data to be identified and the ginseng that are calculated based on correspondence MFCC parameters Examine DTW between template apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average with it is default Threshold value be compared to determine whether include keyword in voiced frame, and corresponding mathematics need not be set up Identification model, it is not required that translated accordingly to keyword, therefore, it can save keyword identification Computing resource, it is possible to improve keyword identification accuracy rate.

Further, it is just right when the spectrum energy of voice data to be identified is more than default energy threshold Corresponding voice data to be identified carries out keyword identification, conversely, not carried out to voice data to be identified then Keyword recognizes, therefore, it can further save computing resource, and improves the speed of keyword identification.

Further, record corresponding reference template when, the reference template include transient noise, The information of the abundant voice content of static noise and particular person so that reference template can be with corresponding spy The affiliated environment of voice and voice for determining people is relatively accurately recorded, and be therefore, it can further improve and is closed The accuracy of keyword identification.

Brief description of the drawings

Fig. 1 is a kind of flow chart of the keyword recognition method in the embodiment of the present invention；

Fig. 2 is the flow chart of another keyword recognition method in the embodiment of the present invention；

Fig. 3 is a kind of structural representation of the keyword identifying device in the embodiment of the present invention.

Specific embodiment

To solve the above-mentioned problems in the prior art, the technical scheme that the embodiment of the present invention is used passes through It is determined that DTW between voice data to be identified and reference template is apart from intermediate value, Euclidean distance intermediate value and mutually Whether the average of correlation distance intermediate value is compared to determine include key in voiced frame with default threshold value Word, can save the computing resource of keyword identification, it is possible to improve the accuracy rate of keyword identification.

It is understandable to enable the above objects, features and advantages of the present invention to become apparent, below in conjunction with the accompanying drawings Specific embodiment of the invention is described in detail.

Fig. 1 shows a kind of flow chart of the keyword recognition method in the embodiment of the present invention.Such as Fig. 1 institutes The keyword recognition method shown, may include steps of：

Step S101：The voice data to be identified for obtaining is divided into the voiced frame of multiple overlaps.

In specific implementation, the size of the lap between each voiced frame can be according to the actual needs It is configured.For example, when the length of each voiced frame is 32ms, the overlapping portion between adjacent sound frame The size divided can be 16ms.

Step S102：The voice signal of the multiple voiced frames obtained to division carries out fast Flourier change respectively Computing is changed, corresponding spectrum energy is obtained.

In specific implementation, multiple voice signals that division is obtained are the voice signal of time domain, by quick Fourier transformation computation (FFT), the voice signal of time domain can be converted to the voice signal of frequency domain.

Step S103：The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, And calculate corresponding MFCC parameters.

In specific implementation, the spectrum energy (work(of voice signal is obtained by quick Fourier transformation computation Rate is composed), the spectrum energy under mel-frequency can be converted to according to default corresponding relation, and according to plum Spectrum energy under your frequency, calculates the corresponding mel-frequency cepstrum coefficient (Mel of each voiced frame Frequency Cepstrum Coefficient, MFCC) parameter.

Step S104：According to the corresponding MFCC parameters of each voiced frame, it is calculated described waits to know respectively DTW between other voice data and default multiple reference templates is apart from intermediate value, Euclidean distance intermediate value and mutually Correlation distance intermediate value.

In specific implementation, include respectively in default multiple reference templates in the voice of corresponding keyword Hold.Wherein, the quantity of default reference template can be configured according to the actual needs, and the present invention exists This is not limited.

Step S105：When it is determined that DTW between the voice data to be identified and current reference template away from When from intermediate value, Euclidean distance intermediate value and with a distance from cross-correlation, the average of intermediate value is less than default threshold value, will be current Keyword in reference template is used as recognition result.

In specific implementation, traveled through by default multiple reference templates, calculate currently treat respectively DTW between identification voice data and current reference template is apart from intermediate value, Euclidean distance intermediate value and cross-correlation Apart from intermediate value, and by current DTW between voice data to be identified and current reference template apart from intermediate value, Euclidean distance intermediate value and cross-correlation are compared apart from the average of intermediate value with default threshold value, when it is determined that described DTW between voice data to be identified and current reference template is apart from intermediate value, Euclidean distance intermediate value and mutually Close apart from intermediate value average be less than default threshold value when, can using the keyword in current reference template as Recognition result；Otherwise, it is determined that including in current reference template in current voice data to be identified The voice messaging of keyword.

Further details of Jie is done to the keyword recognition method in the embodiment of the present invention below in conjunction with Fig. 2 Continue.

Fig. 2 shows the flow chart of another keyword recognition method in the embodiment of the present invention.Such as Fig. 2 Shown keyword recognition method, can include the steps：

Step S201：The voice data of acquisition is carried out into overlap framing, corresponding multiple voiced frames are obtained.

In specific implementation, analog-to-digital conversion can be carried out to the voice signal for being gathered, obtain correspondence first Voice data.Then, corresponding voice data can be carried out overlap framing, obtains multiple voiced frames. Voice data to gathering carries out framing, is substantially to carry out short-time analysis to voice data.Short-time analysis is Voice signal is divided into the time short section with the fixed cycle, each time short section is relatively-stationary lasting Sound clip.Wherein, partly overlapped between two adjacent voiced frames, overlapping range can be according to reality Situation is selected.

Step S202：Windowing process is carried out to resulting multiple voiced frames.

In specific implementation, the Speech processings such as Hamming window, Hanning window, rectangular window can be selected to commonly use Window function, frame length selection be 10~40ms, representative value is 20ms.Wherein, voice signal is divided Frame treatment destroys the naturalness of voice signal, and adding window and return treatment etc. are carried out by using voiced frame, This problem can be solved.

Step S203：Quick Fourier transformation computation will be carried out by the voiced frame after windowing process, obtained The information of the corresponding spectrum energy of each voiced frame.

In specific implementation, voice data in theory for change over time, be a unstable state Process, it is not possible to directly carry out the conversion of frequency domain.But, it is (short due to carrying out sub-frame processing to voice data When analyze), per frame voice data may be considered it is metastable, thus can be applied to frequency domain turn Change.

In specific implementation, short time discrete Fourier transform (Short-Time Fourier can be used Transform/Short-Term Fourier Transform, STFT) frequency domain is carried out to the voice data of every frame Conversion, to obtain the corresponding spectrum information of each voiced frame.Wherein, resulting frequency spectrum includes correspondence Voice signal frequency and the relation of energy.

Step S204：The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, And calculate corresponding MFCC parameters.

In an embodiment of the present invention, when the multiple voiced frames for obtaining current voice data to be identified are corresponding After spectrum energy, whether the spectrum energy of current voice data to be identified can be first determined whether more than default Energy threshold, when it is determined that the spectrum energy of current voice data to be identified is more than the energy threshold, Step S204 is continued executing with, otherwise, it determines the not voice including keyword in current voice data to be identified Information, therefore, just can stop the subsequent treatment to current voice data to be identified, further to save Computing resource.

In specific implementation, can be according to default corresponding relation, the frequency spectrum that will be obtained by FFT computings Energy is converted into the spectrum energy under Mel (Mel) frequency, and calculates the corresponding MFCC of each voiced frame Parameter, as the characteristic vector of each voiced frame.

Step S205：According to the corresponding MFCC parameters of each voiced frame, be calculated current sound frame with DTW in default multiple reference templates between current reference template apart from intermediate value, Euclidean distance intermediate value and Cross-correlation is apart from intermediate value.

In an embodiment of the present invention, calculating between current voice data to be identified and reference template DTW apart from when, current voice data to be identified and reference template are divided into I frames respectively.Meanwhile, Present inventor is rule of thumb known, in the recording process of reference template, the pronunciation meeting of speaker Become excited, and word speed is also slow than usual.Therefore, reference template is divided into I frames, for DTW The often jump size that distance is calculated is 0.1I frames, is being calculated the I frames and ginseng of current voice data to be identified After the DTW distances of the I frames for examining template, using the I intermediate value of DTW distances as current sound to be identified The DTW of sound data and corresponding reference template is apart from intermediate value.Similarly, we can obtain currently waiting to know Euclidean distance (ED) intermediate value and cross-correlation distance (CC) of other voice data and corresponding reference template away from From intermediate value.

Step S206：Judge DTW between voice data to be identified and current reference template apart from intermediate value, Whether Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value；When judged result is When being, step S207 can be performed, conversely, then to default multiple reference templates in next reference mould Plate is performed since step S205.

In specific implementation, current DTW between voice data to be identified and reference template is being calculated Apart from intermediate value, Euclidean distance intermediate value and cross-correlation after intermediate value, by the average of three and default threshold Value is compared.

In an embodiment of the present invention, the noise water of the default threshold value and current voice data to be identified Flat associated, i.e., different noise level, corresponding default threshold value will be different.Wherein, currently treat Recognize that the absolute amplitude probability of voice data is more than as p >=p1, determine the voice data tool to be identified There is a low noise level, p represents the corresponding absolute amplitude of the voice data to be identified, p1 is default the One threshold value；During p2 >=p ＞ p1, determine that the voice data to be identified has medium noise level, p2 is Default Second Threshold, and p1 ＞ p2；As p ＜ p2, determine that the voice data to be identified has height Noise level.In an embodiment of the present invention, p1 is that 0.8, p2 is 0.45.

Step S207：Using the keyword in current reference template is as recognition result and exports.

In specific implementation, when it is determined that certain reference template in default reference template is to be identified with current DTW between voice data is less than apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value During default threshold value, it may be determined that current voice data to be identified includes the keyword in reference template Voice messaging.Therefore, it can the keyword in the reference template as current voice data to be identified Keyword recognition result and export.

In specific implementation, when above-mentioned keyword recognition method is applied in warning system, knowing When not going out corresponding keyword, warning system can perform alarm operation.

It is to be herein pointed out in emergency or other keyword applications, it is simple (as not It is trained) user can be used for recording personalized keyword.In order to ensure good recognition performance, ginseng Examining template becomes extremely important.This can ensure the recording matter of reference template by simple verification operation Amount.

Therefore, present inventor advocates three kinds of detection factors, that is, detect that door (is such as fallen in transient noise source Sound), static noise source (such as fan or traffic noise), and enrich the pronunciation content of keyword.It is above-mentioned Three kinds of factors need to meet simultaneously, will otherwise need to record keyword again.Wherein, the inspection of transient noise Survey, it is possible to use the voiced frame of continuous 25ms, and often jump size is the exhausted of the energy of the voice signal of 5ms To the difference of amplitude.Wherein it is possible to the absolute amplitude of every 5 voiced frames is carried out averagely.Made an uproar in static state During sound detection, the recording of keyword occurs in default 5s time windows in quiet environment.With including key The voice data of word is compared, in 5s time windows, the not beginning and end of the reference template including keyword Signal energy there is larger difference.When abundant pronunciation content is verified, only single vowel and do not have Keyword just like the consonant of " " etc is rejected, and this refusal can be based on and keyword The related amendment zero-crossing rate of pronunciation content is made.

To do further details of to the corresponding device of keyword recognition method in the embodiment of the present invention below Introduce.

Fig. 3 is referred to, the keyword identifying device 300 in the embodiment of the present invention can include sub-frame processing Unit 301, frequency domain converting unit 302, the first computing unit 303, the second computing unit 304, judgement Unit 305 and keyword recognition unit 306, wherein：

The sub-frame processing unit 301, is suitable to for the voice data to be identified for obtaining to be divided into multiple overlaps Voiced frame；

The frequency domain converting unit 302, is suitable to travel through multiple voiced frames for obtaining of division, and will be all over The voice signal of the current sound frame gone through carries out quick Fourier transformation computation, obtains corresponding frequency spectrum energy Amount；

First computing unit 303, is suitable to resulting spectrum energy is converted to the frequency under mel-frequency Spectrum energy, and calculate corresponding MFCC parameters；

In specific implementation, a trigger element (figure can also be set in the keyword identifying device 300 Not shown in), the trigger element is suitable to be more than default energy in the spectrum energy of the current sound frame for traversing During amount threshold value, first computing unit 303 is triggered and performs described resulting spectrum energy is converted into Mel Spectrum energy under frequency, and calculate the operation of corresponding MFCC parameters；

Second computing unit 304, is suitable to according to the corresponding MFCC parameters of current sound frame, difference The DTW between current sound frame and default multiple reference templates is calculated apart from intermediate value, Euclidean distance Intermediate value and cross-correlation are apart from intermediate value；

The judging unit 305, is suitable to judge in the DTW distances between current sound frame and reference template Whether value, Euclidean distance intermediate value and cross-correlation are less than default threshold value apart from the average of intermediate value；

In specific implementation, the default threshold value is associated with the noise level of current sound frame, wherein, As p >=p1, determine that current sound frame has low noise level, p represents that current sound frame is corresponding definitely Amplitude, p1 is default first threshold；As p2 >=p ＞ p1, determine that current sound frame has medium making an uproar Sound level, p2 is default Second Threshold, and p1 ＞ p2；As p ＜ p2, determine that current sound frame has There are high noise levels.Wherein, in an embodiment of the present invention, p1 is equal to 0.8, p2 and is equal to 0.45.

In specific implementation, the reference template includes the rich of transient noise, static noise and particular person The information of rich voice content.

The keyword recognition unit 306, is suitable to when between determination current sound frame and reference template DTW apart from intermediate value, Euclidean distance intermediate value and cross-correlation apart from intermediate value average be less than default threshold value when, Using the keyword in current reference template is as recognition result and exports.

One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment Rapid to can be by program to instruct the hardware of correlation to complete, the program can be stored in computer-readable In storage medium, storage medium can include：ROM, RAM, disk or CD etc..

The method and system to the embodiment of the present invention have been described in detail above, and the present invention is not limited thereto. Any those skilled in the art, without departing from the spirit and scope of the present invention, can make it is various change with Change, therefore protection scope of the present invention should be defined by claim limited range.

Claims

1. a kind of keyword recognition method, it is characterised in that including：

The voice signal of the multiple voiced frames obtained to division carries out quick Fourier transformation computation respectively, obtains Corresponding spectrum energy；

The corresponding spectrum energy of each voiced frame is converted to the spectrum energy under mel-frequency, and calculates correspondence MFCC parameters；

According to the corresponding MFCC parameters of each voiced frame, be calculated respectively the voice data to be identified with DTW between default multiple reference templates is apart from intermediate value, Euclidean distance intermediate value and cross-correlation distance Intermediate value；

When it is determined that the DTW between the voice data to be identified and current reference template is apart from intermediate value, Euclidean Apart from intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when, by current reference template Keyword as recognition result.

2. keyword recognition method according to claim 1, it is characterised in that in the sound to be identified When the spectrum energy of data is more than default energy threshold, perform described by the corresponding frequency of each voiced frame Spectrum energy is converted to the spectrum energy under mel-frequency, and calculates the operation of corresponding MFCC parameters.

3. keyword recognition method according to claim 1, it is characterised in that the default threshold value with The noise level of the voice data to be identified is associated.

4. keyword recognition method according to claim 3, it is characterised in that the sound number to be identified According to noise level include low noise level, medium noise level and high noise levels, wherein：

As p2 >=p ＞ p1, determine that the voice data to be identified has medium noise level, p2 is default Second Threshold, and p1 ＞ p2；

5. keyword recognition method according to claim 4, it is characterised in that p1 is equal to 0.8, p2 etc. In 0.45.

6. keyword recognition method according to claim 1, it is characterised in that wrapped in the reference template Include the information of the abundant voice content of transient noise, static noise and particular person.

7. a kind of keyword identifying device, it is characterised in that including：

Sub-frame processing unit, is suitable to be divided into the voice data to be identified for obtaining the voiced frame of multiple overlaps；

Frequency domain converting unit, the voice signal for being suitable to the multiple voiced frames obtained to division carries out quick Fu respectively Vertical leaf transformation computing, obtains corresponding spectrum energy；

First computing unit, is suitable to the corresponding spectrum energy of each voiced frame is converted to the frequency under mel-frequency Spectrum energy, and calculate corresponding MFCC parameters；

Second computing unit, is suitable to, according to the corresponding MFCC parameters of each voiced frame, institute is calculated respectively State DTW between voice data to be identified and default multiple reference templates apart from intermediate value, Euclidean away from The intermediate value with a distance from intermediate value and cross-correlation；

Keyword recognition unit, is suitable to when between the determination voice data to be identified and current reference template DTW distances, Euclidean distance intermediate value and cross-correlation apart from the average of intermediate value be less than default threshold value when, Using the keyword in current reference template as recognition result.

8. keyword identifying device according to claim 7, it is characterised in that also including trigger element, The trigger element is suitable to be more than default energy cut-off in the spectrum energy of the voice data to be identified During value, the first computing unit execution is triggered described by the corresponding spectrum energy conversion of each voiced frame It is the spectrum energy under mel-frequency, and calculates the operation of corresponding MFCC parameters.

9. keyword identifying device according to claim 7, it is characterised in that the default threshold value with The noise level of the voice data to be identified is associated.

10. keyword identifying device according to claim 9, it is characterised in that the sound number to be identified According to noise level include low noise level, medium noise level and high noise levels, wherein：

11. keyword identifying devices according to claim 10, it is characterised in that p1 is equal to 0.8, p2 Equal to 0.45.

12. keyword identifying devices according to claim 7, it is characterised in that wrapped in the reference template Include the information of the abundant voice content of transient noise, static noise and particular person.