KR20040107173A

KR20040107173A - An Amplitude Warping Approach to Intra-Speaker Normalization for Speech Recognition

Info

Publication number: KR20040107173A
Application number: KR1020030038102A
Authority: KR
Inventors: 홍광석
Original assignee: 홍광석
Priority date: 2003-06-13
Filing date: 2003-06-13
Publication date: 2004-12-20
Also published as: KR100511248B1; AU2003244240A1; WO2004111999A1

Abstract

PURPOSE: An amplitude varying method for intra-speaker normalization in speech recognition is provided to achieve amplitude normalization according to pitch-varied pronunciation and compensate an amplitude difference by estimating a warping factor in a speaker, to thereby improve a recognition rate and reliability of speech recognition applications. CONSTITUTION: A reverse ratio between an input pitch and a reference pitch of a speaker is calculated to decide an amplitude warping factor. The height of a triangle filter is controlled while the amplitude is controlled using the amplitude warping factor and, simultaneously, the gradient of the amplitude on the frequency axis is determined. After the gradient of the amplitude is determined, the result feature vectors are HMM(Hidden Markov Model)-decoded.

Description

An Amplitude Warping Approach to Intra-Speaker Normalization for Speech Recognition}

본 발명은 음성인식에서 화자 내 정규화를 위한 진폭 변경 방법에 관한 것으로서, 더욱 상세하게는 화자 정규화를 위한 새로운 화자 내 워핑(warping) 인수 추정을 통하여 피치 변경발성에 따른 진폭 정규화를 달성함으로써, 음성인식 응용 제품에서 사용자의 목소리에 적합한 인식 모델을 적용, 또는 인식기에 화자적응 기법으로 적용하여 인식률의 향상을 도모하는 동시에 음성인식 응용 제품의 신뢰도를 향상시킬 수 있도록 한 음성인식에서 화자 내 정규화를 위한 진폭 변경 방법에 관한 것이다.The present invention relates to a method for changing the amplitude for in-speaker normalization in speech recognition, and more particularly, to achieve amplitude normalization according to pitch change utterance by estimating a new intra-warping factor for speaker normalization. Amplitude for in-speaker normalization in speech recognition, which can improve recognition rate by applying recognition model appropriate to user's voice in application or speaker adaptation method to recognizer. It is about a change method.

통상적으로 화자의 성대의 성문에서는 목소리의 피치를 제어하는 것으로 알려져 있고, 반면에 성도는 성도는 포먼트(formant)를 통하여 모음들을 결정하고 또한 자음들을 조음시키는 것으로 알려져 있으며, 발성된 음성의 피치와 포먼트 성분들은 음성 신호에서 거의 독립적인 관계에 있다.The vocal cords of the vocal cords are commonly known to control the pitch of voices, while the saints are known to form vowels and form consonants through formants. Formant components are almost independent of the speech signal.

화자들간의 성도 모양 변이로 인한 음성인식 성능 저하를 줄이기 위한 노력으로, 화자 정규화에 주파수 워핑(warping) 기법이 연구되어 왔다. 즉, 화자 상호간의 차이로 인한 효과를 줄이기 위한 음성신호의 파라미터 성분 표현들을 정규화하기 위한 기술들이 연구되어 왔다.In an effort to reduce speech recognition performance due to vocal morphology between speakers, frequency warping techniques have been studied for speaker normalization. That is, techniques for normalizing parametric component representations of speech signals have been studied to reduce the effects of speaker-to-speaker differences.

이에, 화자들 사이의 포먼트 위치 변동을 보상하기 위하여 선형 및 비선형 주파수 워핑(warping) 함수들을 사용하여 정규화가 수행되었다. 이러한 절차들은각 화자의 실제 성도 모양과 일치하고, 이러한 차이들을 위한 보상에 따라서 포먼트 위치 추정의 복잡한 문제를 해결하기 위하여 시도되었다.Thus, normalization was performed using linear and nonlinear frequency warping functions to compensate for formant position variations between speakers. These procedures are consistent with each speaker's actual saint shape and attempt to solve the complex problem of formant position estimation according to the compensation for these differences.

히든 마코프 모델(Hidden Markov Model)에서 출력분포로써 가우시안 믹스쳐(mixture) 들을 사용할 때, 가장 중요한 문제들 중 하나는 다양한 화자 의존적인 스케일 인수들이 믹스쳐 분포들의 다형식에 의하여 모델이 구성되는 경향이 있다.When using Gaussian mixtures as output distributions in the Hidden Markov Model, one of the most important problems is that the various speaker-dependent scale factors tend to be modeled by the polymorphism of the mix distributions. have.

또한, 화자 간 요소는 음성인식에 있어서 중요한 역할을 하는 바, 화자 간 정규화를 필요로 하는 화자 적응을 위하여 성도 정규화(Vocal Tract Normalization)를 기반으로 한다. 이는 감정의 상태에 따른 피치 변경 발성에서 변동 보상에 의해 화자 간 음성의 변동을 줄이기 위하여 시도된다.In addition, the inter-speaker element plays an important role in speech recognition, and is based on Vocal Tract Normalization for speaker adaptation that requires inter-speaker normalization. This is attempted to reduce the fluctuation of voice between speakers by variation compensation in pitch change speech according to the state of emotion.

기존의 성도 정규화 방법은 화자 간 정규화의 정확성을 개선하기 위하여 매우 좋은 방법으로서, 화자간 정규화를 위한 주파수 축 정규화를 기반으로 한다.The existing saint normalization method is a very good way to improve the accuracy of the speaker-to-speaker normalization, based on the frequency axis normalization for speaker-to-speaker normalization.

여기서 화자간 정규화를 위한 주파수 축 정규화에 대하여 설명하면 다음과 같다.Here, the frequency axis normalization for speaker-to-speaker normalization is as follows.

VTN의 가장 중요한 사상은 인식 과정에서 음향학적 벡터들에서 화자 의존 가변성을 제거하기 위해 각각의 화자를 위한 음향학적 벡터들의 주파수 축을 정규화한다. 주어진 소리의 발성을 위하여 스펙트럼의 포먼트(formant) 정점들의 위치들은 성도의 길이에 반비례한다.The most important idea of VTN is to normalize the frequency axis of acoustic vectors for each speaker to remove speaker dependent variability in acoustic vectors during recognition. The positions of the formant vertices of the spectrum for a given sound utterance are inversely proportional to the length of the saints.

이때, 상기 성도의 길이는 대략 13cm에서 18cm까지 다양하게 나타나며, 포먼트 중심 주파수는 화자들 사이에 25%정도 변할 수 있다. 이는 성도의 길이 차이로인하여 포먼트 중심주파수가 사람마다 다르게 나타나고, 약 25% 까지 변한다는 의미이다.In this case, the length of the saints varies from about 13cm to 18cm, and the formant center frequency may vary by about 25% among the speakers. This means that the formant center frequency varies from person to person and varies by about 25% due to differences in saints' lengths.

이런 변이 요소들은 화자 종속부터 화자 독립 음성 인식 성능의 주된 저하 요소이다.These variance factors are major deterioration factors of speaker-dependent to speaker-independent speech recognition performance.

최적의 워핑 팩터(warping factor)는 0.88≤α≤1.12 사이에 균등한 간격의 13개 요소들의 검색으로 얻어진다. 예를들어, 위의 범위에서 알파를 0.02 간격으로 등분하면 0.88, 0.90, 0.92,...., 1.12 등 13개 요소들이 나온다.The optimal warping factor is obtained by searching for 13 elements of equal intervals between 0.88 ≦ α ≦ 1.12. For example, dividing alpha by 0.02 intervals in the above range yields 13 elements: 0.88, 0.90, 0.92, ..., 1.12, and so on.

상기 α의 범위는 어른들에서 발견되는 성도 길이들에서 대략 25% 범위의 변화를 반영하기 위하여 선택된다.The range of α is chosen to reflect a change of approximately 25% range in vocal lengths found in adults.

인식에서 최적의 주파수 워핑 크기를 결정하기 위하여 여러가지 방법들이 제언되었다.Several methods have been suggested to determine the optimal frequency warping magnitude in recognition.

음성인식에서 음향학의 벡터들의 시퀀스(sequence)를 시간 t=1,...,T에 걸쳐서 관찰한다. 즉, In speech recognition, observe the sequence of acoustic vectors over time t = 1, ..., T. In other words,

각각의 가정한 단어 시퀀스 W를 위해 적절한 참조모델 파라미터(parameters)θ를 가지고 모델분포p(X｜W;θ)를 가정한다. 화자 적응 음향학적 모델링에서 어떤 화자의 특성 파마미터 α의 분포에 다음과 같은 의존성이 있다는 것을 가정하였다.For each hypothesized word sequence W, we assume a model distribution p (X | W; θ) with appropriate reference model parameters θ. In speaker-adaptive acoustic modeling, it is assumed that the following dependence on the distribution of the characteristic parameter α of a speaker is assumed.

p(X｜W;θ,α)p (X | W; θ, α)

전형적으로 두가지 변수의 변환으로 구분한다.Typically, you distinguish between two variables.

첫째, 모델 파라미터θ의 변환으로서, 화자 간 특성 파라미터α의 각각의 값에 대해서 정규화되지 않은 모델 파라미터θ를 정규화한 파라미터θ ^α 로 변환한다.First, the model as the transformation of the parameter θ, and converts it into a parameter θ ^α normalizing the non-normalized model parameters θ for each value of the speaker between the characteristic parameter α.

θ → θθ → θ ^αα

따라서, 분포는p(X｜W;θ,α)=p(X｜W; θ ^α )가 된다.Therefore, the distribution becomes p (X | W; θ, α) = p (X | W; θ ^α ) .

둘째, 관측 벡터 X의 변환으로서, 이것은 음향학적 벡터들의 매핑으로 공식화될 수 있다.Second, as a transform of observation vector X, this can be formulated as a mapping of acoustic vectors.

X →XX → X ^αα

이때, 분포는p(X｜W;θ,α)=p(X ^α ｜W; θ)가 된다.At this time, the distribution becomes p (X | W; θ, α) = p (X ^α | W; θ) .

이와 같이, 기존의 성도정규화(VTN: vocal tract normalization) 방법은 사람마다 성도 길이가 다르기 때문에 발생되는 발성의 스펙트럼 포락선 성분의 주파수 축에서의 변화를 주파수 축의 정규화를 통하여 보상해 주는 방법이지만, 주파수 차이는 보상이 되나 진폭 차이는 보상이 되지 않는 단점이 있다.As described above, the conventional vocal tract normalization (VTN) method compensates for the change in the frequency axis of the spectral envelope component of vocalization caused by different vocal lengths from person to person through the normalization of the frequency axis. Is compensated, but the amplitude difference is not compensated.

따라서, 본 발명은 기존의 성도 정규화 방법이 주파수 차이만 보상해주던 점을 감안하여, 화자 정규화를 위한 새로운 화자 내 워핑(warping) 인수 추정을 통하여 피치 변경발성에 따른 진폭 정규화를 달성함과 함께 진폭 차이도 보상해줄 수있도록 함으로써, 음성인식 응용 제품에서 사용자의 목소리에 적합한 인식 모델을 적용, 또는 인식기에 화자적응 기법으로 적용하여 인식률의 향상을 도모하는 동시에 음성인식 응용 제품의 신뢰도를 향상시킬 수 있도록 한 음성인식에서 화자 내 정규화를 위한 진폭 변경 방법을 제공하는데 그 목적이 있다.Therefore, in view of the fact that the existing vocalization normalization method compensates only for the difference in frequency, the present invention achieves amplitude normalization according to pitch change utterance and estimates amplitude through a new intra-warping factor estimation for speaker normalization. In order to compensate for the difference, the recognition model suitable for the user's voice is applied to the speech recognition application product or the speaker adaptation technique is applied to the recognizer to improve the recognition rate and improve the reliability of the speech recognition application product. The purpose of this paper is to provide a method for changing the amplitude for in-speaker normalization.

도 1과 도 2는 피치 변경 발성에 따른 남자와 여자가 발성한 유성음의 선형 예측 계수(LPC) 스펙트럼 포락들을 나타낸다.1 and 2 show linear prediction coefficient (LPC) spectral envelopes of voiced sounds produced by males and females according to pitch change utterances.

도 3은 남성 음성 /a/의 정상(굵은선)발성과 피치를 낮춰서(점선) 발성한 음성스펙트럼들을 나타낸다.Fig. 3 shows voice spectra produced by the normal (bold line) vocalization of male voice / a / and the lowered pitch (dotted line).

도 4는 피치(pitch) 변경 발성에 따라 화자 내 특정한 파라미터β를 나타낸다.4 shows a specific parameter β in the speaker according to the pitch change speech.

도 5a,5b는 믹스쳐(mixture) 기반의 최적의 주파수 워핑(warping) 인수 추정을 나타낸다.5A and 5B show a mixture-based optimal frequency warping factor estimation.

도 6은 주파수 워핑(warping)을 하는 멜-필터 뱅크(mel-filter bank)분석을 나타낸다.Figure 6 shows a mel-filter bank analysis with frequency warping.

도 7은 화자 내 정규화를 위해 진폭 워핑(warping)을 하는 멜-필터 뱅크(mel-filter bank)를 나타낸다.7 shows a mel-filter bank with amplitude warping for speaker normalization.

도 8은α, β를 순서대로 적용하는 방법을 나타낸다.8 shows a method of applying α and β in order.

상기한 목적을 달성하기 위한 본 발명은:The present invention for achieving the above object is:

화자 내 피치 변경 발성으로부터 변형되지 않은 음성의 특정 공간 분포들은 성문과 성도에 의해 발생하는 음성의 음향학적 차이로 인하여 다양하게 나타나므로, 입력화자의 입력피치와 참조피치 사이의 역비율을 계산하여 진폭 워핑 팩터를 결정하고; 상기 진폭 워핑 팩터를 이용하여 삼각필터의 높이를 조절하면서 진폭을 조절하는 동시에 전체 주파수 축 상에서 진폭의 기울기가 결정되도록 한 것을 특징으로 하는 음성인식에서 화자 내 정규화를 위한 진폭 변경 방법을 제공한다.The specific spatial distributions of speech that are not deformed from the speaker's pitch-changing speech vary due to the acoustical differences of speech generated by voice and vocal tracts, so that the amplitude is calculated by calculating the inverse ratio between the input and reference pitches of the input speaker. Determine the warping factor; In the speech recognition, the amplitude change method for normalization in a speaker is characterized in that the amplitude of the triangular filter is adjusted using the amplitude warping factor, and the slope of the amplitude is determined on the entire frequency axis.

바람직한 구현예로서, 상기 입력화자의 입력피치와 참조피치 사이의 역비율을 계산은 보통(평균) 발성시의 피치(참조피치)를 p1으로, 같은 소리를 새로 발성했을 때의 피치를 p2로 하면, p2가 높으면 일반적으로 진폭도 높아지므로 정규화 하기 위하여 진폭을 p1때의 진폭 정도로 조절하고자 p1/p2를 새로운 기울기 조절 상수 값으로 채택하여 계산된 것을 특징으로 한다.In a preferred embodiment, the calculation of the inverse ratio between the input pitch and the reference pitch of the input speaker is given by p1 for the normal (average) speech and p2 for the new voice. In general, if p2 is high, the amplitude is also high, so that it is calculated by adopting p1 / p2 as a new gradient control constant value to adjust the amplitude to the amplitude of p1 for normalization.

더욱 바람직한 구현예로서, 상기 진폭의 기울기가 결정된 후, 그 결과 특징 벡터들은 HMM 디코팅되는 것을 특징으로 한다.In a more preferred embodiment, after the slope of the amplitude is determined, the resulting feature vectors are HMM decoated.

이하, 본 발명의 바람직한 실시예를 첨부도면을 참조로 설명한다.Hereinafter, preferred embodiments of the present invention will be described with reference to the accompanying drawings.

먼저, 화자 내 정규화에 진폭 워핑(warping) 접근 방법을 수행하는데 사용하는 처리과정에 대해 자세히 설명을 한다. 이 처리과정들은 감정에 따른 피치 변경 발성에서 변동의 보상에 의해 화자 내 음성의 변동을 줄이기 위한 시도이다.First, the process used to perform the amplitude warping approach to in-speaker normalization is described in detail. These processes are attempts to reduce fluctuations in voice in the speaker by compensating for fluctuations in pitch change utterances caused by emotion.

피치 변경 발성들에 의한 왜곡들은 음성 신호의 주파수 도메인에서 간단한 선형 워핑(warping)에 의해 설계 될 수 있기 때문에 정규화 절차는 적절하게 추정된 워핑 팩터(warping factor)에 의해 진폭 축을 조절 한다.Since distortions due to pitch change vocals can be designed by simple linear warping in the frequency domain of the speech signal, the normalization procedure adjusts the amplitude axis by an appropriately estimated warping factor.

통상 운율은 감정의 음향학적인 특성들의 표현으로 알려져 있다. 이에 음성 파형 데이터의 유성음 구역으로부터 특징 파라미터를 분석한다. 이는 화자 내 요소를 위한 중요한 점이다.Rhymes is commonly known as the expression of acoustical characteristics of emotions. The feature parameter is analyzed from the voiced sound region of the speech waveform data. This is an important point for the speaker element.

첨부한 도 1과 도 2는 피치 변경 발성에 따른 남자와 여자가 발성한 유성음의 선형 예측 계수(LPC) 스펙트럼 포락들을 나타낸다.1 and 2 show linear prediction coefficient (LPC) spectral envelopes of voiced sounds produced by males and females according to pitch change utterances.

즉, 도 1은 남자가 발성한 유성음의 LPC스펙트럼 포락들(113~251Hz의 pitch를 가지는 모음 /a/)로서, 도 2는 여성 유성음의 LPC 스펙으럼 포락들(194~342Hz의 지역 피치를 가지는 모음 /a/)로서, 더 높은 하모닉스(harmonics)에서 에너지 이득에 대한 이유는 성문의 기류 파형의 비교에 의해서 보여진다. 발음의 세기가 증가함에 따라 성문의 폐쇄비율이 증가한다. 일반적으로 남성의 목소리는 여성의 목소리보다 낮은 기본 주파수와 강한 하모닉(harmonics)을 가지는 경향이 있다.That is, Figure 1 is a male voiced LPC spectrum envelopes (vowel / a / with a pitch of 113 ~ 251 Hz), Figure 2 is a female voiced LPC spectrum envelopes (194 ~ 342 Hz having a local pitch of As the vowel / a /), the reason for the energy gain at higher harmonics is shown by a comparison of the globular airflow waveform. As the intensity of pronunciation increases, the closure rate of the gate increases. In general, male voices tend to have lower fundamental frequencies and stronger harmonics than female voices.

첨부한 도 3은 남성 음성 /a/의 정상(굵은선)발성과 피치를 낮춰서(점선) 발성한 음성스펙트럼들을 보여준다. 화자 내 피치 변경 발성으로부터 변형되지 않은음성의 특징 공간 분포들은 성문과 성도에 의해 발생하는 음성의 음향학적 차이 때문에 다양하다. 그러므로, 본 발명에 따라 참조 피치와 입력 피치의 반비례 계산에 의해 워핑(warping factor)를 고려하는 것이 가능하다.FIG. 3 shows voice spectra produced by the normal (bold line) vocalization of male voice / a / and the lowered pitch (dashed line). The characteristic spatial distributions of speech that are not deformed from the speaker's pitch-changing speech vary due to the acoustical differences caused by voice and vocalization. Therefore, it is possible according to the present invention to consider the warping factor by inverse calculation of the reference pitch and the input pitch.

보다 상세하게는, 상기 워핑 팩터는 입력화자의 피치와 참조 피치 사이의 역비율을 나타낸다.More specifically, the warping factor represents the inverse ratio between the pitch of the input speaker and the reference pitch.

따라서, 상기 진폭 워핑 팩터를 이용하여 삼각필터의 높이를 조절하면서 진폭을 조절하는 동시에 전체 주파수 축 상에서 진폭의 기울기가 용이하게 결정된다.Accordingly, the slope of the amplitude on the entire frequency axis is easily determined while controlling the amplitude while adjusting the height of the triangular filter using the amplitude warping factor.

도 6, 7, 8에 보이는 바와 같이 음성을 분석하여 특징추출할 때 MFCC 방법에서는 주파수축에 삼각 필터를 적용하여 분석한다. 즉, 주파수 축을 일정 대역폭으로 나누어 처리할 때 삼각형 모양의 필터를 사용한다.As shown in FIGS. 6, 7, and 8, the MFCC method analyzes the speech by applying a triangular filter to the frequency axis. In other words, the triangular filter is used when the frequency axis is divided into predetermined bandwidths.

이때, 보통(평균) 발성시의 피치(참조피치)를 p1으로, 같은 소리를 새로 발성했을 때의 피치를 p2로 하면, p2가 높으면 일반적으로 진폭도 높아지므로 정규화 하기 위하여 진폭을 p1때의 진폭 정도로 조절하고자 p1/p2를 새로운 기울기 조절 상수 값으로 채택하게 된다.At this time, if the pitch (reference pitch) at normal (average) speech is set to p1 and the pitch at the time of newly speaking the same sound is set to p2, the amplitude at p1 is normalized because the amplitude is generally higher when p2 is high. To adjust the degree, p1 / p2 is adopted as the new slope adjustment constant value.

그리고, 결과 특징 벡터들은 HMM 디코딩에 사용되고, 목표는 각 테스트 발성의 진폭 스케일을 정규화된 HMM 모델에 매칭을 위해 워핑하는 것이다.The resulting feature vectors are then used for HMM decoding, and the goal is to warp the amplitude scale of each test vocalization to a normalized HMM model.

도 4는 피치(pitch) 변경 발성에 따라 화자 내 특성 파라미터β를 보여준다. 화자 내 특성 스케일(scale) 인수는 스펙트럼의 에너지와 밀접하게 관련이 있다. 도 4는 피치(pitch)와 에너지를 사용하여를 만족하는β를 추정한다.4 shows the characteristic parameter β in the speaker according to the pitch change utterance. The characteristic scale factor in the speaker is closely related to the energy of the spectrum. 4 uses pitch and energy Estimate β satisfying.

화자 내 음향학적 모델링의 정규화에서, 어떤 화자 내 특성 파라미터β분포에 의존한다는 것을 다음과 같이 가정한다.In the normalization of acoustic modeling in a speaker, it is assumed that it depends on the distribution of characteristic parameters β in a speaker.

p(X｜W;θ,α,β)p (X | W; θ, α, β)

전형적으로 두 가지 변환으로 구분하게 된다.Typically you will distinguish between two transformations.

첫째, 모델 파라미터θ의 변환으로서, 화자 내 특성 파라미터β값에 대하여 비정규화된 모델 파라미터 를 정규화한 모델 파라미터θ ^β 로 변환된다.First, the model as the transformation of the parameter θ, is converted into a model parameter θ ^β normalizing the non-normalized model parameters with respect to the speaker within the characteristic parameter β value.

θ→θ ^α,β (또는θ→θ ^β ) θ → θ ^{α, β} (or θ → θ ^β )

분포는 다음과 같다.The distribution is as follows.

p(X｜W;θ,α,β)=p(X｜W;θ ^α,β ) p (X | W; θ, α, β) = p (X | W; θ ^{α, β} )

또는or

p(X｜W;θ,α,β)=p(X ^α ｜W;θ ^β ) p (X | W; θ, α, β) = p (X ^α | W; θ ^β )

둘째, 관측벡터 X의 변환으로서, 음향학적 벡터들의 매핑(mapping)에 따라 다음과 같이 공식화할 수 있다.Second, as a transformation of the observation vector X, it can be formulated according to the mapping of acoustic vectors.

X→X ^α,β (또는 X→X ^β ) X → X ^{α, β} (or X → X ^β )

분포는 다음과 같다.The distribution is as follows.

p(X｜W;θ,α,β)=p(X ^α,β ｜W;θ) p (X | W; θ, α, β) = p (X ^{α, β} | W; θ)

또는or

p(X｜W;θ,α,β)=p(X ^β ｜W;θ ^α ) p (X | W; θ, α, β) = p (X ^β | W; θ ^α )

진폭 축의 화자 내 스케일 요소β는 음성 인식을 위한 음향벡터를 계산하기 전에 진폭 축을 재스케일(rescale)하는데 사용된다.The scale element β in the speaker of the amplitude axis is used to rescale the amplitude axis before calculating the acoustic vector for speech recognition.

이와 같은 본 발명에 따른 음성인식에서 화자 내 정규화를 위한 진폭 변경 방법을 SKKU-1(SKKU: SungKyunKwan University)음성 데이터베이스(DB)를 이용하여 실험하였다.In the speech recognition according to the present invention, the amplitude change method for in-speaker normalization was tested using SKKU-1 (SKKU: SungKyunKwan University) voice database (DB).

SKKU-1음성 DB의 어휘는 한국어 숫자음, 성명, PBW(phonetically balanced word), PRW(phonetically rich word)로 이루어져 있다.The vocabulary of SKKU-1 voice DB is composed of Korean digits, name, phonetically balanced word (PBW) and phonetically rich word (PRW).

음성 신호는 1 - 0.95z^-1로 고역강조 하였고, 20ms의 해밍 윈도우를 취하여 10ms 단위로 분석하였다. 각각의 프레임은 39차원의 특징벡터를 추출하였다.The voice signal was emphasized as high as 1-0.95z ^-1 and the Hamming window of 20ms was taken and analyzed in units of 10ms. Each frame extracted 39-dimensional feature vectors.

특징들은 12차 MFCC(mel-frequency cepstrm coefficient)벡터, 12차 delta-MFCC벡터, 12차 delta-delta-MFCC벡터, log 에너지, delta log 에너지, delta-delta log 에너지이다.The features are the 12th-order mel-frequency cepstrm coefficient (MFCC) vector, the 12th-order delta-MFCC vector, the 12th-order delta-delta-MFCC vector, log energy, delta log energy, and delta-delta log energy.

한편, 상기 PBW(phonetically balanced word)는 음운학적으로 균형된 단어를 선정한 경우, PRW(phonetically rich word)는 PBW 보다 좀더 많은 단어를 포함한 것으로 수천 개의 단어를 포함하고 단어 수에 대한 규정은 없다. 상기 해밍 윈도우(Hamming window)는 음성을 분석할 때 단위 분석구간에 사용하는 대표적인 창함수이다.On the other hand, when the phonetically balanced word (PBW) selects a phonologically balanced word, the phonetically rich word (PRW) includes more words than PBW and includes thousands of words and there is no provision for word count. The Hamming window is a representative window function used in a unit analysis section when analyzing a voice.

도 5a는 믹스쳐(mixture)에 기반을 둔 최적의 주파수 warping 인수 추정을보여준다. 도 5b는 입력 음성의 주파수 워핑에 기반을 둔 최적의 주파수 warping 인수 추정을 보여준다. 음성은 추정된 워핑(warping) 인수를 사용하여 워핑하였고, 특징 벡터들의 결과는 HMM 해독을 위하여 사용된다.5A shows an estimate of an optimal frequency warping factor based on a mixture. 5B shows an estimate of the optimal frequency warping factor based on the frequency warping of the input speech. Negative was warped using the estimated warping factor, and the results of the feature vectors are used for HMM decoding.

도 6은 워핑(warping)을 하는 멜-필터 뱅크(mel-filter bank)분석을 보여준다. 진폭 정규화를 위하여 첫번째, 우리는 발성으로부터 pitch와 에너지를 추출한다. 그리고 둘째로 우리는 화자 내 파라미터를 결정한다.Figure 6 shows a mel-filter bank analysis with warping. For amplitude normalization, first, we extract pitch and energy from vocalization. And secondly we determine the parameters in the speaker.

도 7은 화자 내 정규화를 위해 진폭 워핑(warping)을 하는 멜-필터 뱅크(mel-filter bank)를 보여준다.7 shows a mel-filter bank with amplitude warping for speaker normalization.

보다 상세하게는, 도 6은 인수α가 정해졌을 때 분석하는 방법이고, 도 7은β가 결정되었을 때 분석하는 방법을 나타내며, 도 8은α, β를 순서대로 적용하는 방법을 나타낸다.More specifically, FIG. 6 shows a method for analyzing when the factor α is determined, FIG. 7 shows a method for analyzing when β is determined, and FIG. 8 shows a method of applying α and β in order.

다음의 표 1은 기본 인식기를 사용한 경우와, 그리고 화자간 정규화를 적용한 베이스 기본 인식기를 사용한 경우와, 화자간과 화자 내 정규화를 적용한 베이스기본 인식기를 사용하여 SKKU-1 DB에서 숫자와 단어의 인식 단어 에러율을 보여준다.Table 1 below shows the recognition words of numbers and words in SKKU-1 DB using the base recognizer, the base base recognizer with inter-speaker normalization, and the base base recognizer with inter-speaker and intra-speaker normalization. Show error rate.

위의 표 1에서 보는 바와 같이, 베이스라인(baseline)은 정규화를 적용하지 않은 경우, withα는 기존의 주파수 정규화만 적용한 경우, withαandβ는 기존의 주파수 정규화 및 제안한 진폭정규화를 적용한 경우에 대한 인식 에러율을 나타내는 바, 에러가 점점 줄어드는 것을 알 수 있다.As shown in Table 1 above, the baseline is when normalization is not applied, with α is when only existing frequency normalization is applied, and with α and β is when conventional frequency normalization and proposed amplitude normalization are applied. As the recognition error rate is shown, the error decreases gradually.

즉, 인식 결과에 따르면 단어 인식율은 96.4%와 98.2%이었고, 에러율은 숫자와 단어 인식에 대하여 0.4 ∼ 2.3%로 감소되었다.That is, according to the recognition results, the word recognition rates were 96.4% and 98.2%, and the error rate was reduced from 0.4 to 2.3% for numbers and word recognition.

이상에서 본 바와 같이, 본 발명에 따른 음성인식에서 화자 내 정규화를 위한 진폭 변경 방법에 의하면, 화자 정규화를 위한 새로운 화자 내 워핑(warping) 인수 추정을 통하여 피치 변경발성에 따른 진폭 정규화를 달성할 수 있다.As described above, according to the method of changing the amplitude for in-speaker normalization in speech recognition according to the present invention, amplitude normalization according to pitch change utterance can be achieved through a new in-speaker warping factor estimation for speaker normalization. have.

따라서, 음성인식 응용 제품에서 사용자의 목소리에 적합한 인식 모델을 적용, 또는 인식기에 화자적응 기법으로 적용하여 인식률의 향상을 도모하는 동시에 음성인식 응용 제품의 신뢰도를 향상시킬 수 있다.Therefore, it is possible to improve the recognition rate and improve the reliability of the speech recognition application product by applying a recognition model suitable for the user's voice in the speech recognition application product or by applying the speaker adaptation technique to the recognizer.

Claims

Characteristic spatial distributions of voices that are not deformed from the speaker's pitch-changing voices vary due to the acoustical differences of voices generated by voice and vocal tracts, so that the amplitude is calculated by calculating the inverse ratio between the input pitch and the reference pitch Determine the warping factor;

The amplitude changing method for normalization in the speaker in the speech recognition, characterized in that while controlling the amplitude while adjusting the height of the triangular filter using the amplitude warping factor, the slope of the amplitude on the entire frequency axis.

The method of claim 1, wherein the calculation of the inverse ratio between the input pitch and the reference pitch of the input speaker is performed by setting the pitch (reference pitch) at normal (average) speech to p1 and the pitch at newly sounding the same sound to p2. In general, the higher the p2, the higher the amplitude. Therefore, to normalize the amplitude to the amplitude at p1, p1 / p2 is calculated by adopting a new gradient control constant value. Way.

The method of claim 1 or 2, wherein after the slope of the amplitude is determined, the feature vectors are HMM decoded.