KR20140059601A - Voice recognition performance improvement using intra frame feature - Google Patents

Voice recognition performance improvement using intra frame feature Download PDF

Info

Publication number
KR20140059601A
KR20140059601A KR1020120126211A KR20120126211A KR20140059601A KR 20140059601 A KR20140059601 A KR 20140059601A KR 1020120126211 A KR1020120126211 A KR 1020120126211A KR 20120126211 A KR20120126211 A KR 20120126211A KR 20140059601 A KR20140059601 A KR 20140059601A
Authority
KR
South Korea
Prior art keywords
frame
speech signals
speech signal
signal
speech
Prior art date
Application number
KR1020120126211A
Other languages
Korean (ko)
Inventor
이성주
강병옥
정호영
정훈
전형배
이윤근
Original Assignee
한국전자통신연구원
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 한국전자통신연구원 filed Critical 한국전자통신연구원
Priority to KR1020120126211A priority Critical patent/KR20140059601A/en
Publication of KR20140059601A publication Critical patent/KR20140059601A/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0272Voice signal separating
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/038Speech enhancement, e.g. noise reduction or echo cancellation using band spreading techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

Disclosed is a method for improving automatic voice recognition performance using an intra frame feature. According to the present invention, the method for improving automatic voice recognition performance using an intra frame feature includes: a step of collecting speech signals and preprocessing the collected speech signals by boosting or attenuating the signals; a step of dividing the preprocessed speech signals by threshold band using a gamma-tone filter bank and channelizing signals in each threshold band; a step of frame-blocking the channelized speech signals with a frame shift size of 10 ms and a frame size of 20 - 25 ms; a step of hamming-windowing each blocked channel and extracting a predefined amount of data from the predefined section; a step of estimating signal intensity from the extracted data based on time-frequency and estimating energy based on the estimated signal intensity; a step of getting Cepstral coefficients and derivatives through logarithmic operation and discrete cosine transform for the estimated energy; a step of performing sub-frame analysis for the preprocessed speech signals and extracting intra frame features from the sub-frame analyzed speech signals; and a step of getting voice recognition features by combining the Cepstral coefficients, the derivatives, and the intra frame features.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to an automatic speech recognition method and a speech recognition method,

BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a method for improving automatic speech recognition performance using intraframe characteristics, and more particularly, to a technique for improving performance of automatic speech recognition using intraframe characteristics and a chromatic tone filter bank- .

Conventional digital speech signal framework for automatic speech recognition is composed of frame shift size of 10 ms and frame size of 20 to 25 ms.

This is based on the quasi-stationary assumption theory that the periodic nature of speech is statistically stationary within 20-25 ms.

In the case of voiced sounds such as vowels, it is known that these assumptions do not coincide with each other in the speech signal section where the characteristics change rapidly with time, such as consonants.

The speech feature extraction method for the conventional automatic speech recognition is based on the assumption that one signal frame is stationary and extracts static feature speech feature vectors such as MFCC, PLP, GTPCC, Velocity and acceleration information are extracted, which are called delta and delta-delta characteristics, respectively.

At present, the feature extraction method for automatic speech recognition combines static features and dynamic features to construct one feature vector.

The dynamic characteristics include the delta-prime constant or the delta-delta prime constant, which is a temporal change in the cepstral coefficient.

Traditionally, the size of a signal frame for automatic speech recognition uses 20 to 25 ms, which is based on the periodic nature of the vowel.

Therefore, although the dynamic characteristics of the consonants can be changed to a meaningful level even in one frame, there is a problem that the present speech feature extraction method does not take such consideration into consideration.

Korean Patent Laid-Open No. 1997-0028836 discloses a speech recognition apparatus using a delta-delta cepstrum coefficient and a control method thereof. However, the technology disclosed in Korean Patent Laid-open Publication No. 2003-0028836 discloses a speech recognition apparatus using a delta- There is a limitation in that it is not possible to provide a speech feature extraction technique considering intra frame characteristics.

It is an object of the present invention to provide a method for improving the performance of an automatic speech recognition system by combining intraframe characteristics with existing speech characteristics.

According to another aspect of the present invention, there is provided a method for enhancing automatic speech recognition using intraframe characteristics, comprising the steps of: collecting a speech signal and pre-processing the collected speech signal by emphasizing or attenuating the signal; Dividing the speech signal into a threshold band and channelizing each of the threshold band signals; channelizing the channelized speech signal with frame shift size of 10 ms and frame size of 20 to 25 ms; Performing a hamming windowing process on each of the blocked channels to extract a predetermined amount of data within a predetermined interval, and outputting the extracted data as a signal strength estimate on a time-frequency basis Estimating the energy through the estimated signal intensity, calculating a logarithmic process for the estimated energy, deriving Cepstral coefficients and a derivative through discrete cosine transform (DCT) and peration, performing a sub-frame analysis on a speech signal that has undergone a preprocessing process, Extracting an intraframe characteristic from the signal, and deriving a speech recognition characteristic by combining the cepstral coefficient, the derivative, and the intraframe characteristic.

According to the embodiment of the present invention, when constructing a feature vector for speech recognition, it is possible to improve the speech recognition performance by adding Intra frame characteristics indicating a variation amount of the statistical characteristic of the speech signal in the signal frame.

1 is a flowchart illustrating a method of extracting features of speech using intra frame characteristics according to an embodiment of the present invention.

The present invention will now be described in detail with reference to the accompanying drawings. Hereinafter, a repeated description, a known function that may obscure the gist of the present invention, and a detailed description of the configuration will be omitted. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art. Accordingly, the shapes and sizes of the elements in the drawings and the like can be exaggerated for clarity.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

1 is a flowchart illustrating a method of extracting features of speech using intra frame characteristics according to an embodiment of the present invention.

Referring to FIG. 1, a method of extracting features of speech using Intra-frame characteristics according to an embodiment of the present invention includes first collecting a speech signal of a human and pre-processing it (S10).

In order to analyze a human speech signal, the collected speech signal can be emphasized or attenuated to a predetermined level or higher.

Thereafter, the preprocessed speech signal is divided into critical bands in step S10 through the gamma-tone filter bank, and each critical band signal is channelized (S20).

At this time, the channelized speech signal through the gamma-tone filter bank may be provided to the sub-frame analysis step.

In step S30, each channel separated in step S20 is subjected to a frame blocking process with a frame shift size of 10 ms and a frame size of 20 to 25 ms, respectively.

At this time, if a method of extracting the voice characteristic based on the spectrum analysis such as the Mel-frequency cepstrum coefficients (MFCC) is applied, an extra-spectral analysis process is required along with the subframe analysis process in step S30.

Thereafter, hamming windowing processing is performed on each of the blocked channels to extract predetermined amount of data within a predetermined interval (S40).

Thereafter, the signal intensity is estimated on the time-frequency basis from the data extracted in step S40, and the energy is estimated through the estimated signal intensity (S50).

Then, the estimated energy is subjected to a logarithmic operation and a DCT (Discrete Cosine Transform) to derive cepstral coefficients and derivatives (S60).

Subframe analysis is performed on the speech signal subjected to the preprocessing in step S10 (S70), and intraframe characteristics are extracted from the subframe analyzed speech signal (S80)

At this time, the Intra-frame characteristics include a sub-frame log energy difference in the frame, a sub-frame cepstral coefficient difference in the signal in the frame, Sub-frame resonance ratio difference (resonance) in the frame, Sub-frame ZCR difference in the frame, Sub-frame ZCR difference in the frame, Sub- frame tonality difference, and sub-frame entropy difference information within a frame.

Thereafter, the speech characteristics can be derived by combining the speech recognition characteristics through the cepstral coefficients derived in step S60, the derivative, and the intra frame characteristics extracted in step S80.

While the present invention has been described in detail with reference to the preferred embodiments thereof, those skilled in the art will appreciate that various modifications, additions and substitutions are possible, without departing from the scope and spirit of the invention as disclosed in the accompanying claims. It will be understood. For example, in the form of a recording medium on which a program for realizing the partition restoration method of the present invention is recorded. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of the present invention is defined by the appended claims rather than the detailed description, and all changes or modifications derived from the scope of the claims and their equivalents should be construed as being included in the scope of the present invention.

S10: preprocessing step S20: gamma tone filtering step
S30: Frame blocking step S40: Hamming windowing step
S50: Power and energy estimation step
S60: Post-processing step
S70: Sub-frame analysis step
S80: Intra frame feature extraction step
S90: Voice recognition characteristic combination step

Claims (1)

Collecting the speech signal, and pre-processing the collected speech signal by emphasizing or attenuating it;
Dividing the preprocessed speech signal into a threshold band through a gamma tone filter bank and channelizing each critical band signal;
Performing a frame blocking process on the channelized speech signal with a frame shift size of 10 ms and a frame size of 20 to 25 ms;
Performing a hamming windowing process on each of the blocked channels to extract a predetermined amount of data within a predetermined interval;
Estimating a signal strength of the extracted data on a time-frequency basis, and estimating energy through the estimated signal strength;
Deriving cepstral coefficients and derivatives through a logarithmic operation and a discrete cosine transform (DCT) on the estimated energy;
Performing a sub-frame analysis on the speech signal subjected to the preprocessing process and extracting intraframe characteristics from the speech signal analyzed in the sub-frame;
And deriving a speech recognition characteristic by combining the polyphonic coefficient, the derivative, and the intra frame characteristic.
KR1020120126211A 2012-11-08 2012-11-08 Voice recognition performance improvement using intra frame feature KR20140059601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020120126211A KR20140059601A (en) 2012-11-08 2012-11-08 Voice recognition performance improvement using intra frame feature

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020120126211A KR20140059601A (en) 2012-11-08 2012-11-08 Voice recognition performance improvement using intra frame feature

Publications (1)

Publication Number Publication Date
KR20140059601A true KR20140059601A (en) 2014-05-16

Family

ID=50889380

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020120126211A KR20140059601A (en) 2012-11-08 2012-11-08 Voice recognition performance improvement using intra frame feature

Country Status (1)

Country Link
KR (1) KR20140059601A (en)

Similar Documents

Publication Publication Date Title
US9818433B2 (en) Voice activity detector for audio signals
US11064296B2 (en) Voice denoising method and apparatus, server and storage medium
EP3602549B1 (en) Apparatus and method for post-processing an audio signal using a transient location detection
KR101330237B1 (en) Method and apparatus for adjusting channel delay parameters of multi-channel signal
US8326610B2 (en) Producing phonitos based on feature vectors
JP6793706B2 (en) Methods and devices for detecting audio signals
US9396739B2 (en) Method and apparatus for detecting voice signal
DE102014100407A1 (en) Noise reduction devices and noise reduction methods
EP2381438A1 (en) Signal classification processing method, classification processing device and encoding system
US11335355B2 (en) Estimating noise of an audio signal in the log2-domain
CN110706693A (en) Method and device for determining voice endpoint, storage medium and electronic device
KR20110043695A (en) Method and apparatus to facilitate determining signal bounding frequencies
KR20150032390A (en) Speech signal process apparatus and method for enhancing speech intelligibility
KR102196390B1 (en) Method and apparatus for extracting phase difference parameters between channels
CN115348507A (en) Impulse noise suppression method, system, readable storage medium and computer equipment
US8935159B2 (en) Noise removing system in voice communication, apparatus and method thereof
KR100571427B1 (en) Feature Vector Extraction Unit and Inverse Correlation Filtering Method for Speech Recognition in Noisy Environments
KR20140059601A (en) Voice recognition performance improvement using intra frame feature
KR101096091B1 (en) Apparatus for Separating Voice and Method for Separating Voice of Single Channel Using the Same
Darabian et al. Improving the performance of MFCC for Persian robust speech recognition
WO2009055718A1 (en) Producing phonitos based on feature vectors
US10129659B2 (en) Dialog enhancement complemented with frequency transposition
KR20080049385A (en) Pre-processing method and device for clean speech feature estimation based on masking probability
KR100744375B1 (en) Apparatus and method for processing sound signal
Gbadamosi et al. Non-Intrusive Noise Reduction in GSM Voice Signal Using Non-Parametric Modeling Technique.

Legal Events

Date Code Title Description
WITN Withdrawal due to no request for examination