KR100322693B1

KR100322693B1 - Voice recognition method using linear prediction analysis synthesis

Info

Publication number: KR100322693B1
Application number: KR1019950018111A
Authority: KR
Inventors: 공병구; 김상룡
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1995-06-29
Filing date: 1995-06-29
Publication date: 2002-05-13
Also published as: KR970002856A

Abstract

PURPOSE: A voice recognition method using linear prediction analysis synthesis is provided to remove a variation in voice according to the speaker and a characteristic deviation among speakers to improve recognition rate. CONSTITUTION: A recognition characteristic is extracted from an input voice signal, and the extracted recognition characteristic is compared with a reference recognition characteristic to recognize a voice. For this, speaker-dependent characteristic is removed from the input voice signal to map the voice signal with characteristic of a reference speaker. A synthesized voice is generated based on the mapped characteristic and recognition characteristic is re-extracted from the generated synthesize voice. The extracted recognition characteristic is compared with recognition characteristic of a reference pattern to recognize a voice.

Description

Speech Recognition Using Linear Predictive Analysis Synthesis

본 발명은 입력되는 음성신호로부터 추출된 특징을 미리 준비한 표준특성으로 매핑시키고, 이로부터 합성된 합성음을 대상으로 음성인식을 수행하는 개선된 음성인식방법에 관한 것이다.The present invention relates to an improved speech recognition method for mapping a feature extracted from an input speech signal to a standard feature prepared in advance, and performing speech recognition on the synthesized sound synthesized therefrom.

제1도에 도시되는 바와 같이 종래의 음성인식방법은 입력되는 음성신호에서 추출된 특징을 사용하여 음성인식을 수행한다. 이러한 방식은 인식의 수행정예 기준이 되는 패턴의 작성 혹은 훈련이 필요한 데 실제 인식시 발성태도가 일치할 수가 없기 때문에 인식율의 저하가 초래된다.As shown in FIG. 1, the conventional voice recognition method performs voice recognition using a feature extracted from an input voice signal. This method requires the creation or training of the pattern that is the standard for performing the recognition, but the recognition rate is lowered because the speech attitude cannot be matched in actual recognition.

또한, 화자독립인식의 경우 많은 화자로부터 표본특징을 추출하고 이들을 적당히 그룹화하거나, 특징의 적응기술을 통해 기준패턴에 가까운 특징으로 변환하는과정이 수행되기도 한다. 이때 화자적응기법은 2차원에서 매핑하는 복잡한 알고리즘이며 그 성능이 한계가 있어 기대되는 만큼의 효과를 얻기가 매우 어렵다.In addition, in the case of speaker independence recognition, a process of extracting sample features from many speakers and grouping them appropriately or converting them into features close to the reference pattern through the adaptive technique of the features may be performed. In this case, the speaker adaptation method is a complex algorithm that maps in two dimensions, and its performance is limited, so it is very difficult to obtain the effect as expected.

그 이유는 화자의 특성을 유지하는 특징들의 집합에 또 다른 특성을 갖는 특징을 포함하도록 하는 알고리즘이기 때문에 최대의 효과를 얻기 위해서는 기준이 되는 특징 즉 수많은 화자의 음성데이타로부터 특징을 추출하고 이를 통계화하여야 한다.The reason is that the algorithm is to include a feature with another feature in the set of features that maintain the speaker's characteristics. shall.

게다가 수많은 특성의 존재는 특성의 분포범위가 점점 더 넓어짐을 의미하므로 이에의해 다른 음과의 변별력을 나타내는 경계가 모호해지게 된다.In addition, the presence of a large number of features means that the distribution of the features becomes wider, thereby blurring the boundaries that distinguish them from other sounds.

따라서, 인식률의 향상이 매우 어렵고 화자독립의 경우에는 화자종속의 경우보다 인식률이 떨어지게 되는 문제점이 있다.Therefore, it is very difficult to improve the recognition rate, and in the case of speaker independence, there is a problem that the recognition rate is lower than that of the speaker dependent.

본 발명은 화자의 발성마다의 변화와 화자간의 특성편차를 제거하여 화자종속의 경우보다 화자독립의 경우의 인식률을 향상시키는 음성인식 방법을 제공하는 것을 목적으로 한다.It is an object of the present invention to provide a speech recognition method that improves the recognition rate in the case of speaker independence than that of the speaker dependent by removing the change of each speaker's utterance and the characteristic deviation between the speaker.

상기의 목적을 달성하는 본 발명에 따른 음성인식방법은Speech recognition method according to the present invention to achieve the above object is

입력된 음성신호에서 화자종속적인 특성을 제거하여 표준화자의 특성으로 매핑시키는 과정;Removing speaker-dependent characteristics from the input voice signal and mapping the characteristics of the normalizer;

상기 매핑된 특성에 근거하여 합성음을 발생시키고, 발생된 합성음으로부터 인식특징을 추출하는 과정: 및Generating a synthesized sound based on the mapped characteristic and extracting a recognition feature from the generated synthesized sound: and

상기 합성음으로부터 추출된 인식특징을 표준패턴의 인식특징과 비교하여 음성을 인식하는 과정을 포함함을 특징으로 한다. 이하 첨부된 도면을 참조하여 본발명을 상세히 설명한다.And recognizing the speech by comparing the recognition feature extracted from the synthesized sound with the recognition feature of the standard pattern. Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

제2도는 본 발명에 따른 음성인식방법을 보이는 도면이다. 본 발명에 따른 음성인식방법은 다음과 같은 과정을 통하여 수행된다.2 is a view showing a voice recognition method according to the present invention. Speech recognition method according to the present invention is carried out through the following process.

1) 선형예측분석에 의해 입력된 음성의 특징을 추출한다.1) Extract the features of the input voice by linear predictive analysis.

여기서, 선형예측분석이라 함은 음성신호를 성도모델과, 성도모델에 여기되는 여기신호(excitation signal)로 분석하는 기법으로 이에는 LPC(Liner Predictive Coding), PARCOR(Pattial Corelation), LSP(Line Spectrum Pair)등의 방법이 있다. 본 발명에서는 특징의 특성이 주파수대역에 따라 순차적으로 분포하여 특징매핑이 비교적 손쉬운 LSP분석을 사용하였고, 분석을 통하여 추출한 특징은 피치(pitch), 크기(amplitude), 10차 성도계수이다.Here, linear prediction analysis is a technique for analyzing a speech signal into a vocal tract model and an excitation signal excited by the vocal tract model. This includes linear prediction (PAC), PARCOR (Pattial Corelation), and LSP (Line Spectrum). Pair). In the present invention, the characteristics of the features are sequentially distributed according to the frequency band, so that the feature mapping is relatively easy. LSP analysis is used, and the features extracted through the analysis are pitch, amplitude, and tenth order coefficient.

여기서, 피치는 음성의 억양을 나타내고, 크기는 음성의 크기를 나타내며, 성도계수는 발성화자의 성도를 특징짖는 계수로 공명이 일어나는 주파수정보를 갖는다.Here, the pitch represents the intonation of the voice, the magnitude represents the loudness of the voice, and the vocal tract coefficient has frequency information in which resonance occurs as a coefficient characterizing the vocal vocalization of the utterance speaker.

2) 추출된 특징값을 표준화자의 특징값으로 매핑시킨다.2) The extracted feature values are mapped to feature values of the normalizer.

추출된 피치값은 지정된 대표화자의 피치값으로 매핑된다.The extracted pitch value is mapped to the pitch value of the designated representative speaker.

여기서, 화자가 변화가 심한 경우, 행정에 따른 왜곡이 커진다. 따라서 남자 대표화자, 여자 대표화자, 아이 대표화자의 3 그룹으로 분류하고, 남자/여자/아이 간에는 매핑을 하지 않도록 3개의 기본 특징 패턴을 만든다. 작성된 기본특징패턴은 다음과 같다.Here, when the speaker is severely changed, the distortion according to the stroke becomes large. Therefore, it is classified into three groups of male representative woman, female representative speaker and child representative speaker, and three basic feature patterns are created so as not to map between male / female / child. The created basic feature pattern is as follows.

대표피치값은 3가지로 남성, 여성, 아이를 대표하는 각기 한 사람의 음성신호를 분석하여 각각의 평균값을 미리 정해놓은 값이다.There are three representative pitch values, which are the values of each person's voice representing the male, female, and child.

음성신호 전반에 걸쳐 입력된 음성신호의 평균피치값이 그과의 차이가 가장 적은 대표피치값으로 대치되면 억양이 사라진 평탄한 음성신호가 된다.When the average pitch value of the input voice signal over the voice signal is replaced with the representative pitch value with the smallest difference, the accent disappears into a flat voice signal.

마찬가지로, 추출된 성도계수는 대표화자의 성도계수로 매핑된다.Similarly, the extracted saints' coefficients are mapped to the representative saint's coefficients.

대표화자의 성도계수는 3가지로 남자, 여자, 아이를 대표하는 각 한 사람의 음성신호를 분석하여 각각의 분포범위를 미리 정해놓은 값이다.The vocal tract coefficients of representative speakers are three, and each distribution range is determined by analyzing voice signals of each person representing a man, a woman, and a child.

추출된 입력신호의 각 차수별 성도계수의 분포범위가 남자, 여자, 아이의 분포범위중 가장 가까운 기준성도계수를 택한 후 입력신호의 각 차수별 성도계수의 분포범위가 선택된 대표특징의 각 차수별 분포범위가 되도록 성도계수를 조정한다.The distribution range of the stiffness coefficient for each order of the extracted input signal is the closest among the distribution ranges of male, female, and child, and then the distribution range of the stiffness coefficient for each order of the input signal is selected. Adjust the ductility coefficient to

성도계수의 조정은 이동(shift), 확장, 축소를 통하여 행한다. 이러한 특징매핑은 입력신호의 특성에서 발성태도에 따른 억양성분을 제거하고, 성도특성을 지정된 대표성도특성으로 변환한다는 의미가 된다.Adjustment of the ductility coefficient is performed by shift, expansion and contraction. This feature mapping means that the intonation component of the input signal is removed from the characteristics of the input signal, and the vocal traits are converted into the designated representative traits.

3) 특징매핑이 완료되면 매핑된 특징을 사용하여 합성을 행한다.3) When feature mapping is completed, synthesis is performed using the mapped features.

이때 합성은 제3도에 도시되는 바와 같이 LSP합성으로 매핑된 성도계수들을 합성필터계수로 입력하고, 정해진 피치값에 따라 유성음구간에서는 펄스(pulse), 무성음구간에서는 임의잡음(random noise)을 분석시 구해진 크기에 따라 생성하고, 이를 여기신호로 한다.In this case, as shown in FIG. 3, the saints coefficients mapped to LSP synthesis are input as synthesis filter coefficients, and the noise is analyzed in the voiced sound section and random noise in the unvoiced sound interval according to the determined pitch value. It is generated according to the size obtained at time, and is used as an excitation signal.

보통의 분석합성법에서는 합성의 여기신호로 분석시 추출되는 잔차신호 (residual signal)를 사용하는 데 이는 원음의 완벽한 재현을 위함이다.In general analysis synthesis method, the residual signal extracted during analysis is used as the excitation signal of the synthesis for the perfect reproduction of the original sound.

본 발명에서 잔차신호 대신에 펄스와 잡음을 여기신호로 선정한 것은 음성생성과정이 일정하게 되면 음을 생성하는 과정의 일관성을 유지할 수 있고, 잔차신호에 남아있는 화자특성을 배제할 수 있기 때문이다.In the present invention, the pulse and the noise are selected as the excitation signal instead of the residual signal because the sound generation process can maintain the consistency of the sound generation process and exclude the speaker characteristic remaining in the residual signal.

또한, 재합성의 결과로서 발생된 합성음의 프레임경계구간이 입력된 음성신호의 특징추출시 사용된 프레임경계구간과 일치되게하여 특징추출의 정확도를 향상시키도록 한다.In addition, the frame boundary of the synthesized sound generated as a result of the resynthesis is matched with the frame boundary used during the feature extraction of the input voice signal to improve the accuracy of feature extraction.

4) 합성된 음성을 대상으로 인식특징을 재추출한다.4) Re-extract the recognition features from the synthesized speech.

상기한 과정들의 결과로 화자내의 변화, 화자간의 변화에 의한 특징성분이 제거된 인식특징이 추출된다. 따라서, 종래의 음성인식 알고리즘와 같은 화자적응모듈이 불필요하게 된다.As a result of the above-described processes, the recognition feature from which the feature component due to the change in the speaker and the speaker is removed is extracted. Therefore, the speaker adaptation module like the conventional speech recognition algorithm is unnecessary.

또한. 합성과정을 삽입한 결과로서 프레임주기의 분석구간이 음성신호의 구간특성과 정확히 동기하게 되어 특징추출의 정확도가 향상된다.Also. As a result of the synthesis process, the analysis period of the frame period is synchronized with the interval characteristics of the speech signal accurately, and the accuracy of feature extraction is improved.

5) 기준의 인식알고리즘을 사용하여 음성인식을 수행한다.5) Speech recognition is performed using the standard recognition algorithm.

상기한 인식특징의 재추출과정이 이루어지면 DTW(Dynamic Time Warping), HMM(Hidden Markov Modeling), NN(Neural Network) 등의 인식알고리즘을 통하여 음성인식을 수행한다.When the re-extraction process of the recognition feature is performed, voice recognition is performed through recognition algorithms such as Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Neural Network (NN).

본 발명에서는 DTW방식을 사용하여 비교실험을 실시하였다.In the present invention, a comparative experiment was conducted using the DTW method.

실험의 결과 100개의 단어를 대상으로 한 화자종속 DTW방법의 경우에서는 97%의 인식률을, 화자독립의 DTW방법의 경우에서는 91%의 인식률을 얻은 반면에 본 발명에 따른 DTW방법의 경우에서는 화자독립, 98%의 인식률을 얻을 수 있었다.As a result of the experiment, the recognition rate of 97% was obtained in the speaker-dependent DTW method for 100 words and the speaker independence in the case of the DTW method according to the present invention. The recognition rate was 98%.

상술한 바와 같이 본 발명에 따른 음성인식방법은 입력되는 음성신호로부터 인식특징을 직접 추출하지 않고, 미리 준비한 표준화자의 음성으로 매핑하고, 합성한 합성음을 대상으로 새로운 인식특징을 추출하여 음성인식을 수행함으로써 화자의 발성마다의 변화와 화자간의 특성편차를 제거하여 인식률을 향상시키는 효과를 갖는다.As described above, the speech recognition method according to the present invention does not directly extract the recognition features from the input speech signal, but maps them to a pre-prepared standard voice, extracts new recognition features from the synthesized synthesized sound, and performs the speech recognition. As a result, the recognition rate can be improved by removing the change of each speaker's utterance and the speaker's characteristic deviation.

또한, 많은 화자의 반복훈련이 불필요하고, 화자의 실시간적응이 가능하다는 잇점이 있다.In addition, the repeated training of many speakers is unnecessary, and the speaker can be adapted in real time.

더우기, 선형예측특징의 1차원적인 특성을 이용하여 특징매핑의 간단화 및 정확도향상을 달성하는 효과도 있다.In addition, it has the effect of achieving simpler feature mapping and improved accuracy by using the one-dimensional characteristics of the linear predictive features.

제1도는 종래의 음성인식과정을 보이기 위한 도면이다.1 is a view for showing a conventional speech recognition process.

제2도는 본 발명에 따른 음성인식과정을 보이기 위한 도면이다.2 is a view for showing a voice recognition process according to the present invention.

제3도는 제2도에 도시된 음성합성과정을 수행하는 장치의 일 실시예를 보이는 블럭도이다.3 is a block diagram showing an embodiment of an apparatus for performing the speech synthesis process shown in FIG.

Claims

In the method of recognizing speech by extracting recognition characteristics from the input speech signal and comparing them with the recognition characteristics of the standard,

Process of removing speaker-dependent characteristics from input voice signal and mapping them to standardizer characteristics:

Generating a synthesized sound based on the mapped characteristic and re-extracting a recognition feature from the generated synthesized sound; And

And recognizing a speech by comparing the recognition feature extracted from the synthesized sound with the recognition feature of the standard pattern.

The method of claim 1, wherein the mapping process

Speech recognition method characterized in that the mapping by using the pitch value, magnitude, vocal tract coefficient extracted by the first-order linear prediction analysis.

The speech recognition method of claim 2, wherein the pitch value of the input speech signal is uniformly mapped to the pitch value of the representative speaker having the smallest difference from the extracted average pitch value.

The method of claim 2, wherein the vocal tract coefficient value of the input voice signal is

The distribution range of the vocal tract coefficients of each order of the input voice signal is selected from the standardity coefficients closest among the distribution ranges of the standardizer, and the distribution range of the vocal tract coefficients of each order of the input signal is the distribution range of each order of the selected vocal coefficients of the selected representative. Speech recognition method characterized in that the mapped.

The speech recognition method of claim 1, wherein the characteristics of the standardizer are obtained by analyzing a voice signal of each person representing the voice of a man, a woman, and a child.

The method of claim 1, wherein the resynthesis process

A voice recognition method using pulses having a predetermined frequency as voiced sound and random noise as voiced sound of an vocal model.

The method of claim 1, wherein the resynthesis process

And a frame boundary section of the synthesized sound generated as a result of the resynthesis coincides with the frame boundary section used when extracting the feature of the input voice signal.