KR20140059601A - Voice recognition performance improvement using intra frame feature - Google Patents
Voice recognition performance improvement using intra frame feature Download PDFInfo
- Publication number
- KR20140059601A KR20140059601A KR1020120126211A KR20120126211A KR20140059601A KR 20140059601 A KR20140059601 A KR 20140059601A KR 1020120126211 A KR1020120126211 A KR 1020120126211A KR 20120126211 A KR20120126211 A KR 20120126211A KR 20140059601 A KR20140059601 A KR 20140059601A
- Authority
- KR
- South Korea
- Prior art keywords
- frame
- speech signals
- speech signal
- signal
- speech
- Prior art date
Links
- 238000000034 method Methods 0.000 claims abstract description 24
- 238000007781 pre-processing Methods 0.000 claims abstract description 8
- 230000037433 frameshift Effects 0.000 claims abstract description 5
- 230000000903 blocking effect Effects 0.000 claims description 3
- 238000000605 extraction Methods 0.000 description 5
- 239000013598 vector Substances 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000000737 periodic effect Effects 0.000 description 2
- 238000010183 spectrum analysis Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- -1 MFCC Substances 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 238000007792 addition Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0272—Voice signal separating
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/038—Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Description
BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a method for improving automatic speech recognition performance using intraframe characteristics, and more particularly, to a technique for improving performance of automatic speech recognition using intraframe characteristics and a chromatic tone filter bank- .
Conventional digital speech signal framework for automatic speech recognition is composed of frame shift size of 10 ms and frame size of 20 to 25 ms.
This is based on the quasi-stationary assumption theory that the periodic nature of speech is statistically stationary within 20-25 ms.
In the case of voiced sounds such as vowels, it is known that these assumptions do not coincide with each other in the speech signal section where the characteristics change rapidly with time, such as consonants.
The speech feature extraction method for the conventional automatic speech recognition is based on the assumption that one signal frame is stationary and extracts static feature speech feature vectors such as MFCC, PLP, GTPCC, Velocity and acceleration information are extracted, which are called delta and delta-delta characteristics, respectively.
At present, the feature extraction method for automatic speech recognition combines static features and dynamic features to construct one feature vector.
The dynamic characteristics include the delta-prime constant or the delta-delta prime constant, which is a temporal change in the cepstral coefficient.
Traditionally, the size of a signal frame for automatic speech recognition uses 20 to 25 ms, which is based on the periodic nature of the vowel.
Therefore, although the dynamic characteristics of the consonants can be changed to a meaningful level even in one frame, there is a problem that the present speech feature extraction method does not take such consideration into consideration.
Korean Patent Laid-Open No. 1997-0028836 discloses a speech recognition apparatus using a delta-delta cepstrum coefficient and a control method thereof. However, the technology disclosed in Korean Patent Laid-open Publication No. 2003-0028836 discloses a speech recognition apparatus using a delta- There is a limitation in that it is not possible to provide a speech feature extraction technique considering intra frame characteristics.
It is an object of the present invention to provide a method for improving the performance of an automatic speech recognition system by combining intraframe characteristics with existing speech characteristics.
According to another aspect of the present invention, there is provided a method for enhancing automatic speech recognition using intraframe characteristics, comprising the steps of: collecting a speech signal and pre-processing the collected speech signal by emphasizing or attenuating the signal; Dividing the speech signal into a threshold band and channelizing each of the threshold band signals; channelizing the channelized speech signal with frame shift size of 10 ms and frame size of 20 to 25 ms; Performing a hamming windowing process on each of the blocked channels to extract a predetermined amount of data within a predetermined interval, and outputting the extracted data as a signal strength estimate on a time-frequency basis Estimating the energy through the estimated signal intensity, calculating a logarithmic process for the estimated energy, deriving Cepstral coefficients and a derivative through discrete cosine transform (DCT) and peration, performing a sub-frame analysis on a speech signal that has undergone a preprocessing process, Extracting an intraframe characteristic from the signal, and deriving a speech recognition characteristic by combining the cepstral coefficient, the derivative, and the intraframe characteristic.
According to the embodiment of the present invention, when constructing a feature vector for speech recognition, it is possible to improve the speech recognition performance by adding Intra frame characteristics indicating a variation amount of the statistical characteristic of the speech signal in the signal frame.
1 is a flowchart illustrating a method of extracting features of speech using intra frame characteristics according to an embodiment of the present invention.
The present invention will now be described in detail with reference to the accompanying drawings. Hereinafter, a repeated description, a known function that may obscure the gist of the present invention, and a detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. Accordingly, the shapes and sizes of the elements in the drawings and the like can be exaggerated for clarity.
Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.
1 is a flowchart illustrating a method of extracting features of speech using intra frame characteristics according to an embodiment of the present invention.
Referring to FIG. 1, a method of extracting features of speech using Intra-frame characteristics according to an embodiment of the present invention includes first collecting a speech signal of a human and pre-processing it (S10).
In order to analyze a human speech signal, the collected speech signal can be emphasized or attenuated to a predetermined level or higher.
Thereafter, the preprocessed speech signal is divided into critical bands in step S10 through the gamma-tone filter bank, and each critical band signal is channelized (S20).
At this time, the channelized speech signal through the gamma-tone filter bank may be provided to the sub-frame analysis step.
In step S30, each channel separated in step S20 is subjected to a frame blocking process with a frame shift size of 10 ms and a frame size of 20 to 25 ms, respectively.
At this time, if a method of extracting the voice characteristic based on the spectrum analysis such as the Mel-frequency cepstrum coefficients (MFCC) is applied, an extra-spectral analysis process is required along with the subframe analysis process in step S30.
Thereafter, hamming windowing processing is performed on each of the blocked channels to extract predetermined amount of data within a predetermined interval (S40).
Thereafter, the signal intensity is estimated on the time-frequency basis from the data extracted in step S40, and the energy is estimated through the estimated signal intensity (S50).
Then, the estimated energy is subjected to a logarithmic operation and a DCT (Discrete Cosine Transform) to derive cepstral coefficients and derivatives (S60).
Subframe analysis is performed on the speech signal subjected to the preprocessing in step S10 (S70), and intraframe characteristics are extracted from the subframe analyzed speech signal (S80)
At this time, the Intra-frame characteristics include a sub-frame log energy difference in the frame, a sub-frame cepstral coefficient difference in the signal in the frame, Sub-frame resonance ratio difference (resonance) in the frame, Sub-frame ZCR difference in the frame, Sub-frame ZCR difference in the frame, Sub- frame tonality difference, and sub-frame entropy difference information within a frame.
Thereafter, the speech characteristics can be derived by combining the speech recognition characteristics through the cepstral coefficients derived in step S60, the derivative, and the intra frame characteristics extracted in step S80.
While the present invention has been described in detail with reference to the preferred embodiments thereof, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. It will be understood. For example, in the form of a recording medium on which a program for realizing the partition restoration method of the present invention is recorded. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of the present invention is defined by the appended claims rather than the detailed description, and all changes or modifications derived from the scope of the claims and their equivalents should be construed as being included in the scope of the present invention.
S10: preprocessing step S20: gamma tone filtering step
S30: Frame blocking step S40: Hamming windowing step
S50: Power and energy estimation step
S60: Post-processing step
S70: Sub-frame analysis step
S80: Intra frame feature extraction step
S90: Voice recognition characteristic combination step
Claims (1)
Dividing the preprocessed speech signal into a threshold band through a gamma tone filter bank and channelizing each critical band signal;
Performing a frame blocking process on the channelized speech signal with a frame shift size of 10 ms and a frame size of 20 to 25 ms;
Performing a hamming windowing process on each of the blocked channels to extract a predetermined amount of data within a predetermined interval;
Estimating a signal strength of the extracted data on a time-frequency basis, and estimating energy through the estimated signal strength;
Deriving cepstral coefficients and derivatives through a logarithmic operation and a discrete cosine transform (DCT) on the estimated energy;
Performing a sub-frame analysis on the speech signal subjected to the preprocessing process and extracting intraframe characteristics from the speech signal analyzed in the sub-frame;
And deriving a speech recognition characteristic by combining the polyphonic coefficient, the derivative, and the intra frame characteristic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020120126211A KR20140059601A (en) | 2012-11-08 | 2012-11-08 | Voice recognition performance improvement using intra frame feature |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
KR1020120126211A KR20140059601A (en) | 2012-11-08 | 2012-11-08 | Voice recognition performance improvement using intra frame feature |
Publications (1)
Publication Number | Publication Date |
---|---|
KR20140059601A true KR20140059601A (en) | 2014-05-16 |
Family
ID=50889380
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
KR1020120126211A KR20140059601A (en) | 2012-11-08 | 2012-11-08 | Voice recognition performance improvement using intra frame feature |
Country Status (1)
Country | Link |
---|---|
KR (1) | KR20140059601A (en) |
-
2012
- 2012-11-08 KR KR1020120126211A patent/KR20140059601A/en not_active Application Discontinuation
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9818433B2 (en) | Voice activity detector for audio signals | |
US11064296B2 (en) | Voice denoising method and apparatus, server and storage medium | |
EP3602549B1 (en) | Apparatus and method for post-processing an audio signal using a transient location detection | |
KR101330237B1 (en) | Method and apparatus for adjusting channel delay parameters of multi-channel signal | |
US8326610B2 (en) | Producing phonitos based on feature vectors | |
JP6793706B2 (en) | Methods and devices for detecting audio signals | |
US9396739B2 (en) | Method and apparatus for detecting voice signal | |
DE102014100407A1 (en) | Noise reduction devices and noise reduction methods | |
EP2381438A1 (en) | Signal classification processing method, classification processing device and encoding system | |
US11335355B2 (en) | Estimating noise of an audio signal in the log2-domain | |
CN110706693A (en) | Method and device for determining voice endpoint, storage medium and electronic device | |
KR20110043695A (en) | Method and apparatus to facilitate determining signal bounding frequencies | |
KR20150032390A (en) | Speech signal process apparatus and method for enhancing speech intelligibility | |
KR102196390B1 (en) | Method and apparatus for extracting phase difference parameters between channels | |
CN115348507A (en) | Impulse noise suppression method, system, readable storage medium and computer equipment | |
US8935159B2 (en) | Noise removing system in voice communication, apparatus and method thereof | |
KR100571427B1 (en) | Feature Vector Extraction Unit and Inverse Correlation Filtering Method for Speech Recognition in Noisy Environments | |
KR20140059601A (en) | Voice recognition performance improvement using intra frame feature | |
KR101096091B1 (en) | Apparatus for Separating Voice and Method for Separating Voice of Single Channel Using the Same | |
Darabian et al. | Improving the performance of MFCC for Persian robust speech recognition | |
WO2009055718A1 (en) | Producing phonitos based on feature vectors | |
US10129659B2 (en) | Dialog enhancement complemented with frequency transposition | |
KR20080049385A (en) | Pre-processing method and device for clean speech feature estimation based on masking probability | |
KR100744375B1 (en) | Apparatus and method for processing sound signal | |
Gbadamosi et al. | Non-Intrusive Noise Reduction in GSM Voice Signal Using Non-Parametric Modeling Technique. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WITN | Withdrawal due to no request for examination |