KR20140059601A

KR20140059601A - Voice recognition performance improvement using intra frame feature

Info

Publication number: KR20140059601A
Application number: KR1020120126211A
Authority: KR
Inventors: 이성주; 강병옥; 정호영; 정훈; 전형배; 이윤근
Original assignee: 한국전자통신연구원
Priority date: 2012-11-08
Filing date: 2012-11-08
Publication date: 2014-05-16

Abstract

Disclosed is a method for improving automatic voice recognition performance using an intra frame feature. According to the present invention, the method for improving automatic voice recognition performance using an intra frame feature includes: a step of collecting speech signals and preprocessing the collected speech signals by boosting or attenuating the signals; a step of dividing the preprocessed speech signals by threshold band using a gamma-tone filter bank and channelizing signals in each threshold band; a step of frame-blocking the channelized speech signals with a frame shift size of 10 ms and a frame size of 20 - 25 ms; a step of hamming-windowing each blocked channel and extracting a predefined amount of data from the predefined section; a step of estimating signal intensity from the extracted data based on time-frequency and estimating energy based on the estimated signal intensity; a step of getting Cepstral coefficients and derivatives through logarithmic operation and discrete cosine transform for the estimated energy; a step of performing sub-frame analysis for the preprocessed speech signals and extracting intra frame features from the sub-frame analyzed speech signals; and a step of getting voice recognition features by combining the Cepstral coefficients, the derivatives, and the intra frame features.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to an automatic speech recognition method and a speech recognition method,

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a method for improving automatic speech recognition performance using intraframe characteristics, and more particularly, to a technique for improving performance of automatic speech recognition using intraframe characteristics and a chromatic tone filter bank- .

Conventional digital speech signal framework for automatic speech recognition is composed of frame shift size of 10 ms and frame size of 20 to 25 ms.

This is based on the quasi-stationary assumption theory that the periodic nature of speech is statistically stationary within 20-25 ms.

In the case of voiced sounds such as vowels, it is known that these assumptions do not coincide with each other in the speech signal section where the characteristics change rapidly with time, such as consonants.

The speech feature extraction method for the conventional automatic speech recognition is based on the assumption that one signal frame is stationary and extracts static feature speech feature vectors such as MFCC, PLP, GTPCC, Velocity and acceleration information are extracted, which are called delta and delta-delta characteristics, respectively.

At present, the feature extraction method for automatic speech recognition combines static features and dynamic features to construct one feature vector.

The dynamic characteristics include the delta-prime constant or the delta-delta prime constant, which is a temporal change in the cepstral coefficient.

Traditionally, the size of a signal frame for automatic speech recognition uses 20 to 25 ms, which is based on the periodic nature of the vowel.

Therefore, although the dynamic characteristics of the consonants can be changed to a meaningful level even in one frame, there is a problem that the present speech feature extraction method does not take such consideration into consideration.

Korean Patent Laid-Open No. 1997-0028836 discloses a speech recognition apparatus using a delta-delta cepstrum coefficient and a control method thereof. However, the technology disclosed in Korean Patent Laid-open Publication No. 2003-0028836 discloses a speech recognition apparatus using a delta- There is a limitation in that it is not possible to provide a speech feature extraction technique considering intra frame characteristics.

It is an object of the present invention to provide a method for improving the performance of an automatic speech recognition system by combining intraframe characteristics with existing speech characteristics.

According to another aspect of the present invention, there is provided a method for enhancing automatic speech recognition using intraframe characteristics, comprising the steps of: collecting a speech signal and pre-processing the collected speech signal by emphasizing or attenuating the signal; Dividing the speech signal into a threshold band and channelizing each of the threshold band signals; channelizing the channelized speech signal with frame shift size of 10 ms and frame size of 20 to 25 ms; Performing a hamming windowing process on each of the blocked channels to extract a predetermined amount of data within a predetermined interval, and outputting the extracted data as a signal strength estimate on a time-frequency basis Estimating the energy through the estimated signal intensity, calculating a logarithmic process for the estimated energy, deriving Cepstral coefficients and a derivative through discrete cosine transform (DCT) and peration, performing a sub-frame analysis on a speech signal that has undergone a preprocessing process, Extracting an intraframe characteristic from the signal, and deriving a speech recognition characteristic by combining the cepstral coefficient, the derivative, and the intraframe characteristic.

According to the embodiment of the present invention, when constructing a feature vector for speech recognition, it is possible to improve the speech recognition performance by adding Intra frame characteristics indicating a variation amount of the statistical characteristic of the speech signal in the signal frame.

1 is a flowchart illustrating a method of extracting features of speech using intra frame characteristics according to an embodiment of the present invention.

The present invention will now be described in detail with reference to the accompanying drawings. Hereinafter, a repeated description, a known function that may obscure the gist of the present invention, and a detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. Accordingly, the shapes and sizes of the elements in the drawings and the like can be exaggerated for clarity.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

Referring to FIG. 1, a method of extracting features of speech using Intra-frame characteristics according to an embodiment of the present invention includes first collecting a speech signal of a human and pre-processing it (S10).

In order to analyze a human speech signal, the collected speech signal can be emphasized or attenuated to a predetermined level or higher.

Thereafter, the preprocessed speech signal is divided into critical bands in step S10 through the gamma-tone filter bank, and each critical band signal is channelized (S20).

At this time, the channelized speech signal through the gamma-tone filter bank may be provided to the sub-frame analysis step.

In step S30, each channel separated in step S20 is subjected to a frame blocking process with a frame shift size of 10 ms and a frame size of 20 to 25 ms, respectively.

At this time, if a method of extracting the voice characteristic based on the spectrum analysis such as the Mel-frequency cepstrum coefficients (MFCC) is applied, an extra-spectral analysis process is required along with the subframe analysis process in step S30.

Thereafter, hamming windowing processing is performed on each of the blocked channels to extract predetermined amount of data within a predetermined interval (S40).

Thereafter, the signal intensity is estimated on the time-frequency basis from the data extracted in step S40, and the energy is estimated through the estimated signal intensity (S50).

Then, the estimated energy is subjected to a logarithmic operation and a DCT (Discrete Cosine Transform) to derive cepstral coefficients and derivatives (S60).

Subframe analysis is performed on the speech signal subjected to the preprocessing in step S10 (S70), and intraframe characteristics are extracted from the subframe analyzed speech signal (S80)

At this time, the Intra-frame characteristics include a sub-frame log energy difference in the frame, a sub-frame cepstral coefficient difference in the signal in the frame, Sub-frame resonance ratio difference (resonance) in the frame, Sub-frame ZCR difference in the frame, Sub-frame ZCR difference in the frame, Sub- frame tonality difference, and sub-frame entropy difference information within a frame.

Thereafter, the speech characteristics can be derived by combining the speech recognition characteristics through the cepstral coefficients derived in step S60, the derivative, and the intra frame characteristics extracted in step S80.

While the present invention has been described in detail with reference to the preferred embodiments thereof, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. It will be understood. For example, in the form of a recording medium on which a program for realizing the partition restoration method of the present invention is recorded. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of the present invention is defined by the appended claims rather than the detailed description, and all changes or modifications derived from the scope of the claims and their equivalents should be construed as being included in the scope of the present invention.

S10: preprocessing step S20: gamma tone filtering step
S30: Frame blocking step S40: Hamming windowing step
S50: Power and energy estimation step
S60: Post-processing step
S70: Sub-frame analysis step
S80: Intra frame feature extraction step
S90: Voice recognition characteristic combination step

Claims

Collecting the speech signal, and pre-processing the collected speech signal by emphasizing or attenuating it;
Dividing the preprocessed speech signal into a threshold band through a gamma tone filter bank and channelizing each critical band signal;
Performing a frame blocking process on the channelized speech signal with a frame shift size of 10 ms and a frame size of 20 to 25 ms;
Performing a hamming windowing process on each of the blocked channels to extract a predetermined amount of data within a predetermined interval;
Estimating a signal strength of the extracted data on a time-frequency basis, and estimating energy through the estimated signal strength;
Deriving cepstral coefficients and derivatives through a logarithmic operation and a discrete cosine transform (DCT) on the estimated energy;
Performing a sub-frame analysis on the speech signal subjected to the preprocessing process and extracting intraframe characteristics from the speech signal analyzed in the sub-frame;
And deriving a speech recognition characteristic by combining the polyphonic coefficient, the derivative, and the intra frame characteristic.