CN106920558B - Keyword recognition method and device - Google Patents

Keyword recognition method and device Download PDF

Info

Publication number
CN106920558B
CN106920558B CN201510993729.8A CN201510993729A CN106920558B CN 106920558 B CN106920558 B CN 106920558B CN 201510993729 A CN201510993729 A CN 201510993729A CN 106920558 B CN106920558 B CN 106920558B
Authority
CN
China
Prior art keywords
voice data
median
recognized
template
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510993729.8A
Other languages
Chinese (zh)
Other versions
CN106920558A (en
Inventor
孙廷玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Spreadtrum Communications Shanghai Co Ltd
Original Assignee
Spreadtrum Communications Shanghai Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Spreadtrum Communications Shanghai Co Ltd filed Critical Spreadtrum Communications Shanghai Co Ltd
Priority to CN201510993729.8A priority Critical patent/CN106920558B/en
Publication of CN106920558A publication Critical patent/CN106920558A/en
Application granted granted Critical
Publication of CN106920558B publication Critical patent/CN106920558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/16Hidden Markov models [HMMs]

Abstract

The method and the device for identifying the keywords comprise the following steps: dividing the acquired voice data to be identified into a plurality of overlapped voice frames; respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy; converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency, and calculating the corresponding MFCC parameters; respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame; and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results. According to the scheme, the accuracy of keyword identification can be improved, and computing resources are saved.

Description

Keyword recognition method and device
Technical Field
The invention relates to the technical field of voice recognition, in particular to a keyword recognition method and device.
Background
Speech recognition is a technique in which a machine converts human speech into corresponding text or instructions through a recognition and understanding process. As an important branch of the speech Recognition field, keyword (IWR) Recognition is widely used in the fields of communication, consumer electronics, self-service, office automation, and the like.
In the prior art, Hidden Markov Models (HMMs) and their corresponding parameters, or a keyword recognition system (KWS) are generally used for keyword recognition.
However, in the keyword recognition method in the prior art, a corresponding model needs to be established, and corresponding translation operation training model parameters are needed, so that the problems of large calculation amount and low recognition accuracy rate exist.
Disclosure of Invention
The embodiment of the invention solves the problems of improving the accuracy of keyword identification and saving computing resources.
In order to solve the above problem, an embodiment of the present invention provides a keyword recognition method, where the keyword recognition method includes:
dividing the acquired voice data to be identified into a plurality of overlapped voice frames;
respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;
converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency, and calculating the corresponding MFCC parameters;
respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame;
and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results.
Optionally, when the spectral energy of the sound data to be identified is greater than a preset energy threshold, the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at the mel frequency and calculating the corresponding MFCC parameter is performed.
Optionally, the preset threshold is associated with a noise level of the voice data to be recognized.
Optionally, the noise level of the voice data to be recognized includes a low noise level, a medium noise level and a high noise level, wherein:
when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;
when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;
when p < p2, the sound data to be recognized is determined to have a high noise level.
Alternatively, p1 equals 0.8 and p2 equals 0.45.
Optionally, the reference template includes information of transient noise, static noise, and rich speech content of a specific person.
The embodiment of the invention also provides a keyword recognition device, which comprises:
the framing processing unit is suitable for dividing the acquired sound data to be identified into a plurality of overlapped sound frames;
the frequency domain conversion unit is suitable for respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;
the first calculation unit is suitable for converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency and calculating the corresponding MFCC parameters;
the second calculation unit is suitable for respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame;
the judging unit is suitable for judging whether the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the current sound frame and the current reference template is smaller than a preset threshold value or not;
and the keyword identification unit is suitable for taking the keywords in the current reference template as identification results when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is smaller than a preset threshold value.
Optionally, the voice recognition device further includes a triggering unit, and the triggering unit is adapted to trigger the first calculating unit to perform the operation of converting the spectral energy corresponding to each voice frame into the spectral energy at the mel frequency and calculating the corresponding MFCC parameter when the spectral energy of the voice data to be recognized is greater than a preset energy threshold.
Optionally, the preset threshold is associated with a noise level of the voice data to be recognized.
Optionally, the noise level of the voice data to be recognized includes a low noise level, a medium noise level and a high noise level, wherein:
when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;
when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;
when p < p2, the sound data to be recognized is determined to have a high noise level.
Alternatively, p1 equals 0.8 and p2 equals 0.45.
Optionally, the reference template includes information of transient noise, static noise, and rich speech content of a specific person.
Compared with the prior art, the technical scheme of the invention has the following advantages:
according to the scheme, whether the sound frame comprises the keywords is determined by comparing the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the sound data to be recognized and the reference template which are calculated based on the corresponding MFCC parameters with the preset threshold value, and a corresponding mathematical recognition model does not need to be established, and corresponding translation of the keywords does not need to be carried out, so that the calculation resources for keyword recognition can be saved, and the accuracy of keyword recognition can be improved.
Further, when the frequency spectrum energy of the voice data to be recognized is greater than the preset energy threshold, the corresponding voice data to be recognized is subjected to keyword recognition, otherwise, the voice data to be recognized is not subjected to keyword recognition, so that the computing resources can be further saved, and the speed of keyword recognition is increased.
Further, when recording the corresponding reference template, the reference template includes transient noise, static noise and rich voice content information of the specific person, so that the reference template can be accurately recorded with the voice of the corresponding specific person and the environment to which the voice belongs, and therefore, the accuracy of keyword recognition can be further improved.
Drawings
FIG. 1 is a flow chart of a keyword recognition method in an embodiment of the present invention;
FIG. 2 is a flow chart of another keyword recognition method in an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a keyword recognition apparatus in an embodiment of the present invention.
Detailed Description
In order to solve the above problems in the prior art, in the technical scheme adopted by the embodiment of the invention, whether the sound frame includes the keyword is determined by comparing the mean value of the DTW distance median, the euclidean distance median and the cross-correlation distance median between the sound data to be recognized and the reference template with the preset threshold value, so that the calculation resources for keyword recognition can be saved, and the accuracy of keyword recognition can be improved.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Fig. 1 shows a flowchart of a keyword recognition method in an embodiment of the present invention. The keyword recognition method shown in fig. 1 may include the following steps:
step S101: and dividing the acquired voice data to be identified into a plurality of overlapped voice frames.
In a specific implementation, the size of the overlapping portion between the voice frames can be set according to actual needs. For example, when the length of each voice frame is 32ms, the size of the overlapping portion between adjacent voice frames may be 16 ms.
Step S102: and respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy.
In a specific implementation, the plurality of divided sound signals are time-domain sound signals, and the time-domain sound signals can be converted into frequency-domain sound signals through Fast Fourier Transform (FFT).
Step S103: and converting the spectral energy corresponding to each sound frame into the spectral energy under the Mel frequency, and calculating the corresponding MFCC parameters.
In a specific implementation, the spectrum energy (power spectrum) of the sound signal is obtained through fast fourier transform operation, and may be converted into the spectrum energy at the Mel Frequency according to a preset corresponding relationship, and the Mel Frequency Cepstrum Coefficient (MFCC) parameter corresponding to each sound frame is calculated according to the spectrum energy at the Mel Frequency.
Step S104: and respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame.
In a specific implementation, the preset multiple reference templates respectively include the voice contents of the corresponding keywords. The number of the preset reference templates may be set according to actual needs, and the present invention is not limited herein.
Step S105: and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the key words in the current reference template as recognition results.
In specific implementation, a plurality of preset reference templates are traversed, a DTW distance median, a Euclidean distance median and a cross-correlation distance median between current voice data to be recognized and the current reference template are respectively calculated, the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the current voice data to be recognized and the current reference template is compared with a preset threshold, and when the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is determined to be smaller than the preset threshold, a keyword in the current reference template can be used as a recognition result; otherwise, determining that the current voice data to be recognized does not include the voice information of the keyword in the current reference template.
The keyword recognition method in the embodiment of the present invention will be described in further detail with reference to fig. 2.
Fig. 2 is a flowchart illustrating another keyword recognition method according to an embodiment of the present invention. The keyword recognition method as shown in fig. 2 may include the following steps:
step S201: and overlapping and framing the acquired sound data to obtain a plurality of corresponding sound frames.
In specific implementation, first, analog-to-digital conversion may be performed on the collected sound signal to obtain corresponding sound data. Next, the corresponding audio data may be overlapped and framed to obtain a plurality of audio frames. The collected sound data is subjected to framing, and the essence is that the sound data is subjected to short-time analysis. The short-time analysis is to divide the sound signal into short time segments with fixed periods, each short time segment being a relatively fixed sustained sound segment. The two adjacent sound frames are partially overlapped, and the overlapping range can be selected according to the actual situation.
Step S202: the obtained plurality of audio frames are subjected to windowing processing.
In specific implementation, window functions commonly used for speech signal processing, such as a hamming window, a hanning window, a rectangular window, and the like, can be selected, the frame length is selected to be 10-40 ms, and the typical value is 20 ms. Among them, framing a speech signal destroys the naturalness of the speech signal, and windowing, retracing, and the like are performed using a speech frame, which can solve this problem.
Step S203: and carrying out fast Fourier transform operation on the sound frames subjected to windowing processing to obtain information of the frequency spectrum energy corresponding to each sound frame.
In a specific implementation, the sound data theoretically changes with time, is an unstable process, and cannot be directly converted into a frequency domain. However, since the sound data is subjected to framing processing (short-time analysis), the sound data of each frame can be considered to be relatively stable, and thus frequency domain conversion can be applied thereto.
In a specific implementation, Short-Time Fourier Transform (Short-Time Fourier Transform/Short-Term Fourier Transform, STFT) may be used to perform frequency domain conversion on the sound data of each frame, so as to obtain the spectrum information corresponding to each sound frame. Wherein, the obtained frequency spectrum comprises the relation between the frequency and the energy of the corresponding sound signal.
Step S204: and converting the spectral energy corresponding to each sound frame into the spectral energy under the Mel frequency, and calculating the corresponding MFCC parameters.
In an embodiment of the present invention, after obtaining the spectrum energies corresponding to the multiple voice frames of the current voice data to be recognized, it may first be determined whether the spectrum energy of the current voice data to be recognized is greater than a preset energy threshold, and when it is determined that the spectrum energy of the current voice data to be recognized is greater than the energy threshold, step S204 is continuously performed, otherwise, it is determined that the current voice data to be recognized does not include the speech information of the keyword, so that the subsequent processing on the current voice data to be recognized may be stopped, so as to further save the computing resources.
In a specific implementation, the spectral energy obtained through the FFT operation may be converted into spectral energy at Mel (Mel) frequency according to a preset corresponding relationship, and the MFCC parameter corresponding to each voice frame is calculated as the feature vector of each voice frame.
Step S205: and calculating to obtain a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the current sound frame and a current reference template in a plurality of preset reference templates according to the MFCC parameters corresponding to each sound frame.
In an embodiment of the present invention, in calculating the DTW distance between the current sound data to be recognized and the reference template, the current sound data to be recognized and the reference template are respectively divided into I frames. At the same time, the inventors of the present application have learned empirically that during the recording of the reference template, the speaker's pronunciation becomes more excited and the speech rate is slower than usual. Therefore, the reference template is divided into I frames, the size of each hop for calculating the DTW distance is 0.1I frame, and after the DTW distance between the I frame of the current voice data to be recognized and the I frame of the reference template is obtained through calculation, the median value of the I DTW distances is used as the median value of the DTW distances between the current voice data to be recognized and the corresponding reference template. Similarly, we can obtain the median Euclidean Distance (ED) and the median cross-correlation distance (CC) of the current voice data to be recognized and the corresponding reference template.
Step S206: judging whether the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is smaller than a preset threshold value or not; when the determination result is yes, step S207 may be executed, otherwise, the execution is started from step S205 for the next reference template in the preset plurality of reference templates.
In specific implementation, after a DTW distance median, a euclidean distance median and a cross-correlation distance median between current voice data to be recognized and a reference template are obtained through calculation, a mean value of the three is compared with a preset threshold.
In an embodiment of the present invention, the preset threshold is associated with a noise level of the current voice data to be recognized, that is, different noise levels, and the corresponding preset thresholds will be different. When the absolute amplitude probability of the current voice data to be recognized is greater than or equal to p1, determining that the voice data to be recognized has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be recognized, and p1 is a preset first threshold; when p2 is more than or equal to p and more than p1, determining that the voice data to be recognized has a medium noise level, wherein p2 is a preset second threshold value, and p1 is more than p 2; when p < p2, the sound data to be recognized is determined to have a high noise level. In one embodiment of the present invention, p1 is 0.8 and p2 is 0.45.
Step S207: and taking the key words in the current reference template as recognition results and outputting the recognition results.
In a specific implementation, when it is determined that a mean value of a DTW distance median, a euclidean distance median, and a cross-correlation distance median between a certain reference template of the preset reference templates and the current voice data to be recognized is smaller than a preset threshold, it may be determined that the current voice data to be recognized includes voice information of a keyword in the reference template. Therefore, the keywords in the reference template can be used as the keyword recognition result of the current voice data to be recognized and output.
In a specific implementation, when the above keyword recognition method is applied to an alarm system, the alarm system may perform an alarm operation when a corresponding keyword is recognized.
It should be noted that in emergency or other keyword applications, a simple (e.g., untrained) user may be used to record personalized keywords. To ensure good recognition performance, the reference template becomes very important. This ensures the recording quality of the reference template by a simple check operation.
Therefore, the inventor of the present application advocates three detection factors, namely, detecting a transient noise source (such as a door drop), a static noise source (such as a fan or traffic noise), and enriching the pronunciation content of the keyword. The three factors need to be satisfied simultaneously, otherwise, the keyword needs to be recorded again. For the detection of transient noise, a difference of absolute amplitude of energy of a sound signal of which each hop is 5ms and which is continuous with 25ms sound frames may be used. Where the absolute amplitude of every 5 sound frames can be averaged. In static noise detection, the recording of keywords occurs within a preset 5s time window in a quiet environment. The signal energy of the beginning and end of the reference template, which does not include the keyword, has a large difference in the 5s time window compared to the sound data including the keyword. In checking rich pronunciation content, keywords having only a single vowel without a consonant such as "o" are rejected, which can be made based on a modified zero-crossing rate associated with the pronunciation content of the keyword.
The following describes a device corresponding to the keyword recognition method in the embodiment of the present invention in further detail.
Referring to fig. 3, the keyword recognition apparatus 300 according to the embodiment of the present invention may include a framing processing unit 301, a frequency domain converting unit 302, a first calculating unit 303, a second calculating unit 304, a determining unit 305, and a keyword recognition unit 306, wherein:
the framing processing unit 301 is adapted to divide the acquired sound data to be identified into a plurality of overlapped sound frames;
the frequency domain conversion unit 302 is adapted to traverse a plurality of sound frames obtained by dividing, and perform fast fourier transform operation on the sound signals of the traversed current sound frame to obtain corresponding spectral energy;
the first calculating unit 303 is adapted to convert the obtained spectral energy into spectral energy at a mel frequency, and calculate a corresponding MFCC parameter;
in a specific implementation, the keyword recognition apparatus 300 may further include a triggering unit (not shown in the figure), which is adapted to trigger the first calculating unit 303 to perform the operation of converting the obtained spectrum energy into the spectrum energy at the Mel frequency and calculating the corresponding MFCC parameter when the spectrum energy of the traversed current sound frame is greater than a preset energy threshold;
the second calculating unit 304 is adapted to calculate, according to the MFCC parameter corresponding to the current sound frame, a DTW distance median, an euclidean distance median, and a cross-correlation distance median between the current sound frame and a plurality of preset reference templates, respectively;
the determining unit 305 is adapted to determine whether an average of a DTW distance median, a euclidean distance median, and a cross-correlation distance median between the current sound frame and the reference template is smaller than a preset threshold;
in a specific implementation, the preset threshold is associated with the noise level of the current sound frame, wherein when p ≧ p1, the current sound frame is determined to have a low noise level, p represents the corresponding absolute amplitude of the current sound frame, and p1 is a preset first threshold; when p2 is more than or equal to p & gtp 1, determining that the current sound frame has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2; when p < p2, it is determined that the current sound frame has a high noise level. In one embodiment of the present invention, p1 equals 0.8, and p2 equals 0.45.
In a specific implementation, the reference template includes information of transient noise, stationary noise, and rich speech content of a particular person.
The keyword recognition unit 306 is adapted to, when it is determined that the mean of the DTW distance median, the euclidean distance median, and the cross-correlation distance median between the current sound frame and the reference template is smaller than a preset threshold, take the keyword in the current reference template as a recognition result and output the recognition result.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.
The method and system of the embodiments of the present invention have been described in detail, but the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (12)

1. A keyword recognition method, comprising:
dividing the acquired voice data to be identified into a plurality of overlapped voice frames;
respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;
converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency, and calculating the corresponding MFCC parameters;
respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to MFCC parameters corresponding to each voice frame; the plurality of reference templates respectively comprise voice contents of corresponding keywords; when calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and the preset multiple reference templates, dividing the voice data to be identified and the reference templates into I frames; the method comprises the steps that each hop for calculating DTW distance, Euclidean distance and cross-correlation distance is 0.1I frame, after the DTW distance, the Euclidean distance and the cross-correlation distance between the I frame of current voice data to be recognized and the I frame of a reference template are obtained through calculation, the median value of I DTW distances is used as the DTW median value of the voice data to be recognized and the corresponding reference template, the median value of I Euclidean distances is used as the median value of the Euclidean distances between the voice data to be recognized and the corresponding reference template, and the median value of I cross-correlation distances is used as the median value of the cross-correlation distance between the voice data to be recognized and the corresponding reference template;
and when determining that the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the voice data to be recognized and the current reference template is smaller than a preset threshold value, taking the keywords in the current reference template as the keyword recognition result of the voice data to be recognized.
2. The keyword recognition method according to claim 1, wherein the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at mel frequency and calculating the corresponding MFCC parameter is performed when the spectral energy of the sound data to be recognized is greater than a preset energy threshold.
3. The method according to claim 1, wherein the predetermined threshold is associated with a noise level of the voice data to be recognized.
4. The keyword recognition method according to claim 3, wherein the noise level of the voice data to be recognized includes a low noise level, a medium noise level, and a high noise level, wherein:
when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;
when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;
when p < p2, the sound data to be recognized is determined to have a high noise level.
5. The keyword recognition method according to claim 4, wherein p1 is equal to 0.8 and p2 is equal to 0.45.
6. The keyword recognition method according to claim 1, wherein the reference template includes information of transient noise, static noise and rich voice content of a specific person.
7. A keyword recognition apparatus, comprising:
the framing processing unit is suitable for dividing the acquired sound data to be identified into a plurality of overlapped sound frames;
the frequency domain conversion unit is suitable for respectively carrying out fast Fourier transform operation on the sound signals of the plurality of divided sound frames to obtain corresponding frequency spectrum energy;
the first calculation unit is suitable for converting the spectrum energy corresponding to each sound frame into the spectrum energy under the Mel frequency and calculating the corresponding MFCC parameters;
the second calculation unit is suitable for respectively calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and a plurality of preset reference templates according to the MFCC parameters corresponding to each voice frame; the plurality of reference templates respectively comprise voice contents of corresponding keywords; when calculating a DTW distance median, a Euclidean distance median and a cross-correlation distance median between the voice data to be identified and the preset multiple reference templates, dividing the voice data to be identified and the reference templates into I frames; the method comprises the steps that each hop for calculating DTW distance, Euclidean distance and cross-correlation distance is 0.1I frame, after the DTW distance, the Euclidean distance and the cross-correlation distance between the I frame of current voice data to be recognized and the I frame of a reference template are obtained through calculation, the median value of I DTW distances is used as the DTW median value of the voice data to be recognized and the corresponding reference template, the median value of I Euclidean distances is used as the median value of the Euclidean distances between the voice data to be recognized and the corresponding reference template, and the median value of I cross-correlation distances is used as the median value of the cross-correlation distance between the voice data to be recognized and the corresponding reference template;
the judging unit is suitable for judging whether the mean value of the DTW distance median, the Euclidean distance median and the cross-correlation distance median between the current sound frame and the current reference template is smaller than a preset threshold value or not;
and the keyword identification unit is suitable for taking the keywords in the current reference template as the keyword identification result of the voice data to be identified when the mean value of the DTW distance, the Euclidean distance median and the cross-correlation distance median between the voice data to be identified and the current reference template is determined to be smaller than a preset threshold value.
8. The keyword recognition apparatus according to claim 7, further comprising a triggering unit, wherein the triggering unit is adapted to trigger the first computing unit to perform the operation of converting the spectral energy corresponding to each sound frame into the spectral energy at the mel frequency and computing the corresponding MFCC parameters when the spectral energy of the sound data to be recognized is greater than a preset energy threshold.
9. The keyword recognition apparatus according to claim 7, wherein the preset threshold is associated with a noise level of the voice data to be recognized.
10. The keyword recognition apparatus according to claim 9, wherein the noise level of the voice data to be recognized includes a low noise level, a medium noise level, and a high noise level, wherein:
when p is larger than or equal to p1, determining that the voice data to be identified has a low noise level, wherein p represents the absolute amplitude corresponding to the voice data to be identified, and p1 is a preset first threshold;
when p2 is larger than or equal to p & gtp 1, determining that the voice data to be identified has a medium noise level, p2 is a preset second threshold value, and p1 & gtp 2;
when p < p2, the sound data to be recognized is determined to have a high noise level.
11. The keyword recognition apparatus of claim 10, wherein p1 is equal to 0.8 and p2 is equal to 0.45.
12. The keyword recognition apparatus according to claim 7, wherein the reference template includes information of transient noise, static noise and rich voice content of a specific person.
CN201510993729.8A 2015-12-25 2015-12-25 Keyword recognition method and device Active CN106920558B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510993729.8A CN106920558B (en) 2015-12-25 2015-12-25 Keyword recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510993729.8A CN106920558B (en) 2015-12-25 2015-12-25 Keyword recognition method and device

Publications (2)

Publication Number Publication Date
CN106920558A CN106920558A (en) 2017-07-04
CN106920558B true CN106920558B (en) 2021-04-13

Family

ID=59454658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510993729.8A Active CN106920558B (en) 2015-12-25 2015-12-25 Keyword recognition method and device

Country Status (1)

Country Link
CN (1) CN106920558B (en)

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2005114576A1 (en) * 2004-05-21 2005-12-01 Asahi Kasei Kabushiki Kaisha Operation content judgment device
CN101222703A (en) * 2007-01-12 2008-07-16 杭州波导软件有限公司 Identity verification method for mobile terminal based on voice identification
CN101599269B (en) * 2009-07-02 2011-07-20 中国农业大学 Phonetic end point detection method and device therefor
US8432368B2 (en) * 2010-01-06 2013-04-30 Qualcomm Incorporated User interface methods and systems for providing force-sensitive input
CN102509547B (en) * 2011-12-29 2013-06-19 辽宁工业大学 Method and system for voiceprint recognition based on vector quantization based
CN103021409B (en) * 2012-11-13 2016-02-24 安徽科大讯飞信息科技股份有限公司 A kind of vice activation camera system
CN103065627B (en) * 2012-12-17 2015-07-29 中南大学 Special purpose vehicle based on DTW and HMM evidence fusion is blown a whistle sound recognition methods
CN103971678B (en) * 2013-01-29 2015-08-12 腾讯科技(深圳)有限公司 Keyword spotting method and apparatus
CN103854645B (en) * 2014-03-05 2016-08-24 东南大学 A kind of based on speaker's punishment independent of speaker's speech-emotion recognition method
CN104978507B (en) * 2014-04-14 2019-02-01 中国石油化工集团公司 A kind of Intelligent controller for logging evaluation expert system identity identifying method based on Application on Voiceprint Recognition
CN104103280B (en) * 2014-07-15 2017-06-06 无锡中感微电子股份有限公司 The method and apparatus of the offline speech terminals detection based on dynamic time consolidation algorithm
CN104103272B (en) * 2014-07-15 2017-10-10 无锡中感微电子股份有限公司 Audio recognition method, device and bluetooth earphone
CN104778951A (en) * 2015-04-07 2015-07-15 华为技术有限公司 Speech enhancement method and device

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
"Voice Command Recognition system based on MFCC and DTW";Abhijeet Kumar;《International Journal or engineering Science and Technology》;20101231;全文 *
"Voice Recognition Algorithms using Mel Frequency Cepstral Coefficient and DTW techniques";Lindasalwa;《Journal of Computing》;20100331;第2卷(第3期);全文 *
"一种结合端点检测可检错的DTW乐谱跟随算法";吴康妍;《计算机应用与软件》;20150315;全文 *
"加权DTW距离的自动步态识别";刘志镜;《中国图像图形学报》;20101231;全文 *
"时间序列动态模糊聚类的研究";赵晓慧;《中国优秀硕士学位论文全文数据库 信息科技辑》;20141231;全文 *

Also Published As

Publication number Publication date
CN106920558A (en) 2017-07-04

Similar Documents

Publication Publication Date Title
US9875739B2 (en) Speaker separation in diarization
US8140330B2 (en) System and method for detecting repeated patterns in dialog systems
Zhou et al. Efficient audio stream segmentation via the combined T/sup 2/statistic and Bayesian information criterion
WO2017084360A1 (en) Method and system for speech recognition
WO2014153800A1 (en) Voice recognition system
CN100485780C (en) Quick audio-frequency separating method based on tonic frequency
Ananthapadmanabha et al. Detection of the closure-burst transitions of stops and affricates in continuous speech using the plosion index
EP2083417B1 (en) Sound processing device and program
KR101942521B1 (en) Speech endpointing
Lokhande et al. Voice activity detection algorithm for speech recognition applications
CN108198547B (en) Voice endpoint detection method and device, computer equipment and storage medium
CN105529028A (en) Voice analytical method and apparatus
CN108091340B (en) Voiceprint recognition method, voiceprint recognition system, and computer-readable storage medium
Priyadarshani et al. Dynamic Time Warping based speech recognition for isolated Sinhala words
Jung et al. Linear-scale filterbank for deep neural network-based voice activity detection
JP5282523B2 (en) Basic frequency extraction method, basic frequency extraction device, and program
Tong et al. Evaluating VAD for automatic speech recognition
US20190348032A1 (en) Methods and apparatus for asr with embedded noise reduction
CN106920558B (en) Keyword recognition method and device
Zehetner et al. Wake-up-word spotting for mobile systems
Chaudhary et al. Gender identification based on voice signal characteristics
JP6526602B2 (en) Speech recognition apparatus, method thereof and program
KR101122590B1 (en) Apparatus and method for speech recognition by dividing speech data
Guo et al. Research on voice activity detection in burst and partial duration noisy environment
Fukuda et al. Breath-detection-based telephony speech phrasing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant