KR19980028644A

KR19980028644A - Speech recognition device and method

Info

Publication number: KR19980028644A
Application number: KR1019960047781A
Authority: KR
Inventors: 공병구; 김상룡
Original assignee: 김광호; 삼성전자 주식회사
Priority date: 1996-10-23
Filing date: 1996-10-23
Publication date: 1998-07-15
Also published as: KR100393196B1

Abstract

위상 정보를 이용하여 음성 인식을 수행하는 장치 및 방법이 개시되어 있다. 본 발명에 따른 음성 인식 장치에서 샘플링된 아날로그 음성 신호는 아날로그 디지탈 컨버터에 의하여 디지탈 신호로 변환된 후 스펙트럼 분석기 및 위상 분석기로 인가된다. 스펙트럼 분석기 및 위상 분석기는 디지탈 음성 신호로부터 스펙트럼 분석 및 위상 분석을 수행하여 각각 스펙트럼 패턴 및 위상 패턴을 발생한다. 서로 다르게 인식되고자 하는 음성 신호들에 대한 스펙트럼 패턴들은 스펙트럼형 템플리트/모델에 저장되어 있으며, 위상 패턴들은 동일 스펙트럼을 가지는 것들끼리 그룹별로 분류되어 위상형 템플리트/모델에 저장되어 있다. 제1 패턴 매칭부는 스펙트럼형 템플리트/모델에 저장되어 있는 스펙트럼 패턴들중 스펙트럼 분석기의 출력에 일치하는 것에 대응되는 음성 데이타를 제1 음성 인식 데이타로서 출력하여 이를 제2 패턴 매칭부로 인가된다. 제2 패턴 매칭부는 제1 음성 인식 데이타에 따라 위상형 템플리트/모델의 각 그룹중 어느 한 그룹을 선택하여, 선택된 그룹에 속하는 위상 패턴들중 위상 분석기의 출력과 일치하는 것에 대응되어 저장되어 있는 음성 데이타를 최종 인식 결과인 제2 음성 인식 데이타로서 출력한다. 이와 같은 음성 인식 장치는 잡음 환경에서도 변별력을 가지며, 자연어를 원활히 인식할 수 있는 장점이 있다.An apparatus and method are disclosed for performing speech recognition using phase information. The analog speech signal sampled in the speech recognition apparatus according to the present invention is converted into a digital signal by an analog digital converter and then applied to a spectrum analyzer and a phase analyzer. The spectrum analyzer and the phase analyzer perform spectral analysis and phase analysis from the digital speech signal to generate spectral patterns and phase patterns, respectively. The spectral patterns for speech signals to be recognized differently are stored in the spectral template / model, and the phase patterns are stored in the phased template / model by grouping those having the same spectrum. The first pattern matching unit outputs speech data corresponding to the output of the spectrum analyzer among the spectral patterns stored in the spectral template / model as first speech recognition data and is applied to the second pattern matching unit. The second pattern matching unit selects one group from each group of the phased template / model according to the first speech recognition data, and stores the speech corresponding to the output of the phase analyzer among the phase patterns belonging to the selected group. The data is output as the second speech recognition data which is the final recognition result. Such a speech recognition apparatus has a discriminating power even in a noisy environment, and has a merit of recognizing a natural language smoothly.

Description

Speech recognition device and method

본 발명은 음성 인식 장치 및 방법에 관한 것으로, 특히 음성 신호에 대한 식별 능력이 높은 음성 인식 장치 및 방법에 관한 것이다.The present invention relates to a speech recognition apparatus and method, and more particularly, to a speech recognition apparatus and method having a high identification ability for speech signals.

도 1을 참조하면, 종래 기술에 따른 음성 인식 장치는 아날로그 디지탈 컨버터(110), 스펙트럼 분석기(120), 패턴 매칭부(140), 스펙트럼형 템플리트/모델(130)을 구비한다. 아날로그 디지탈 컨버터(110)는 샘플링된 아날로그 형태의 음성 신호를 디지탈 신호로 변환하여 출력한다. 디지탈 음성 신호는 스펙트럼 분석기(120)에서 분석되어 그 안에 포함된 스펙트럼 정보에 근거하여 스펙트럼 패턴이 발생된다. 스펙트럼형 템플리트/모델(130)은 서로 다르게 구별되어야 할 각 음성 신호들에 대한 스펙트럼 패턴들을 저장하고 있다. 패턴 매칭부(140)는 스펙트럼형 템플리트/모델(130)에 저장되어 있는 스펙트럼 패턴들중에서 스펙트럼 분석기(120)로부터 출력되는 스펙트럼 패턴과 일치하는 스펙트럼 패턴에 대응되는 음성 데이타를 인식 결과로서 출력한다. 이와 같은 음성 인식 장치는 스펙트럼 분석 결과만을 이용하여 음성 인식을 수행하기 때문에 주파수 형태가 완연히 다른 음성 신호들을 구분하는 경우에는 그 기능을 충분히 달성할 수 있으나, 서로 다르게 인식되어야 할 음성 신호가 주파수 형태가 비슷한 경우에는 이를 구별하지 못하여 인식률을 향상시킬 수 없는 문제점이 있다. 즉, 도 2a 및 도 2b에서 도시한 바와 같이, 스펙트럼 형태가 큰 차이를 보이는 경우에는 단순히 스펙트럼 크기 정보만으로 도 2a 및 도 2b에 대응되는 서로 다른 2개의 음성 신호들을 구별하는 것이 가능하다. 반면에, 도 2c 및 도 2d에 도시한 바와 같이, 서로 다르게 인식되어야 하지만 주파수의 형태가 비슷한 음성 신호는 스펙트럼 정보만으로 구별하는 데에는 한계가 있다. 예를 들면, 오 및 우는 서로 비슷한 주파수 형태를 가지기 때문에 단순히 스펙트럼 정보만으로 이들을 구별하여 인식하는 것은 어려운 점이 있다. 더욱이, 음성 신호가 단일의 음성원이 아니라 다양한 음성원으로부터 발생된 음성을 분석하여 음성 인식을 수행하는 경우(즉, 한 사람의 음성이 아니라 여러 사람의 음성을 분석하여 음성 인식을 수행하는 경우)에는, 서로 구별되어야 할 음성 신호의 스펙트럼 크기 정보의 구별성이 떨어지기 때문에, 스펙트럼 정보만으로 음성 신호를 구별하는 것이 더욱 어렵게 된다. 또한, 인식되어야 할 음성 신호가 고립어가 아니라 완전한 발음 형태를 갖지 못하는 자연음인 경우에도 스펙트럼 크기 정보의 구별성이 모호해지기 때문에 스펙트럼 크기 정보만으로 음성 인식을 수행하는 것이 불가능하게 되는 문제점이 있다.Referring to FIG. 1, the speech recognition apparatus according to the related art includes an analog digital converter 110, a spectrum analyzer 120, a pattern matching unit 140, and a spectral template / model 130. The analog digital converter 110 converts a sampled analog type voice signal into a digital signal and outputs the digital signal. The digital speech signal is analyzed by the spectrum analyzer 120 to generate a spectral pattern based on the spectral information contained therein. The spectral template / model 130 stores spectral patterns for each speech signal that should be distinguished differently. The pattern matching unit 140 outputs voice data corresponding to a spectral pattern that matches the spectral pattern output from the spectrum analyzer 120 among the spectral patterns stored in the spectral template / model 130 as a recognition result. Since the speech recognition apparatus performs speech recognition using only the spectrum analysis result, the speech recognition apparatus can fully achieve the function when distinguishing speech signals having different frequency forms, but the speech signals to be recognized differently have different frequency forms. In a similar case, there is a problem in that the recognition rate cannot be improved because it cannot be distinguished. That is, as shown in FIGS. 2A and 2B, when the spectral shape shows a large difference, it is possible to distinguish two different voice signals corresponding to FIGS. 2A and 2B by simply spectral size information. On the other hand, as shown in Figures 2c and 2d, speech signals that are to be recognized differently, but similar in the form of frequency, there is a limit in distinguishing only the spectral information. For example, since oh and right have similar frequency forms, it is difficult to distinguish them by simply spectral information. Moreover, when the speech signal performs voice recognition by analyzing voices generated from various voice sources instead of a single voice source (that is, when voice recognition is performed by analyzing voices of several people instead of one voice). Since the distinction of the spectral magnitude information of the speech signals to be distinguished from each other is inferior, it becomes more difficult to distinguish the speech signals only by the spectral information. In addition, even when the speech signal to be recognized is not an isolated word but a natural sound that does not have a complete pronunciation form, it is impossible to perform speech recognition using only the spectral size information because the distinction of the spectral size information is blurred.

따라서, 본 발명의 목적은 다양한 음성원으로부터 발생된 음성 신호인 경우에도 인식률이 떨어지지 않는 음성 인식 장치 및 방법을 제공하는 것이다.Accordingly, it is an object of the present invention to provide a speech recognition apparatus and method in which the recognition rate does not drop even in the case of speech signals generated from various speech sources.

본 발명의 또 다른 목적은 고립어가 아닌 자연음인 경우에도 인식률이 높은 음성 인식 장치 및 방법을 제공하는 것이다.It is still another object of the present invention to provide a speech recognition apparatus and method having a high recognition rate even in the case of natural sounds instead of isolated words.

도 1은 종래 기술에 따른 음성 인식 장치의 블럭도이다.1 is a block diagram of a speech recognition apparatus according to the prior art.

도 2a 내지 도 2d는 음성 인식의 특성을 설명하기 위한 도면들이다.2A to 2D are diagrams for describing characteristics of speech recognition.

도 3은 본 발명에 따른 음성 인식 장치의 블럭도이다.3 is a block diagram of a speech recognition apparatus according to the present invention.

도 4는 도 3에서 스펙트럼 분석기(120) 및 위상 분석기(310)의 구체적인 블럭도를 나타낸 도면이다.FIG. 4 is a detailed block diagram of the spectrum analyzer 120 and the phase analyzer 310 of FIG. 3.

도 5a 및 도 5b는 서로 다르게 인식되어야 할 음성 신호들이 주파수 크기 형태가 비슷하게 나타나고 있음을 나타내는 그래프들이고, 도 5c는 주파수 크기 형태가 같은 경우일지라도 그들의 표준화된 위상 크기의 형태는 다르게 나타남을 보여주는 그래프들이다.5A and 5B are graphs showing that the frequency signals have similar shapes in frequency, and FIG. 5C is a graph showing that their normalized phase sizes are shown differently even when the frequency sizes are the same. .

도면의 주요 부분에 대한 부호의 설명Explanation of symbols for the main parts of the drawings

110...아날로그 디지탈 컨버터 120...스펙트럼 분석기Analog digital converter 120 Spectrum analyzer

130...스펙트럼형 템플리트/모델 140...패턴 매칭부130 ... Spectral template / model 140 ... Pattern matching

310...위상 분석기 320...위상형 템플리트/모델310 ... Phase Analyzer 320 ... Phase Templates / Models

330...패턴 매칭부330 ... pattern matching part

상기한 목적들을 달성하기 위하여, 본 발명에 의한 음성 인식 장치는 아날로그 디지탈 컨버터, 스펙트럼 분석기, 위상 분석기, 스펙트럼형 템플리트/모델, 위상형 템플리트/모델, 제1 패턴 매칭부 및 제2 패턴 매칭부를 구비한다. 샘플링된 아날로그 음성 신호는 아날로그 디지탈 컨버터에 의하여 디지탈 신호로 변환된 후 스펙트럼 분석기 및 위상 분석기로 인가된다. 스펙트럼 분석기 및 위상 분석기는 디지탈 음성 신호로부터 스펙트럼 분석 및 위상 분석을 수행하여 각각 스펙트럼 패턴 및 위상 패턴을 발생한다. 서로 다르게 인식되고자 하는 음성 신호들에 대한 스펙트럼 패턴들은 스펙트럼형 템플리트/모델에 저장되어 있으며, 위상 패턴들은 동일 스펙트럼을 가지는 것들끼리 그룹별로 분류되어 위상형 템플리트/모델에 저장되어 있다. 제1 패턴 매칭부는 스펙트럼형 템플리트/모델에 저장되어 있는 스펙트럼 패턴들중 스펙트럼 분석기의 출력에 일치하는 것에 대응되는 음성 데이타를 제1 음성 인식 데이타로서 출력하여 이를 제2 패턴 매칭부로 인가된다. 제2 패턴 매칭부는 제1 음성 인식 데이타에 따라 위상형 템플리트/모델의 각 그룹중 어느 한 그룹을 선택하여, 선택된 그룹에 속하는 위상 패턴들중 위상 분석기의 출력과 일치하는 것에 대응되어 저장되어 있는 음성 데이타를 최종 인식 결과인 제2 음성 인식 데이타로서 출력한다.In order to achieve the above objects, the speech recognition apparatus according to the present invention includes an analog digital converter, a spectrum analyzer, a phase analyzer, a spectral template / model, a phased template / model, a first pattern matching unit and a second pattern matching unit. do. The sampled analog speech signal is converted into a digital signal by an analog digital converter and then applied to a spectrum analyzer and a phase analyzer. The spectrum analyzer and the phase analyzer perform spectral analysis and phase analysis from the digital speech signal to generate spectral patterns and phase patterns, respectively. The spectral patterns for speech signals to be recognized differently are stored in the spectral template / model, and the phase patterns are stored in the phased template / model by grouping those having the same spectrum. The first pattern matching unit outputs speech data corresponding to the output of the spectrum analyzer among the spectral patterns stored in the spectral template / model as first speech recognition data and is applied to the second pattern matching unit. The second pattern matching unit selects one group from each group of the phased template / model according to the first speech recognition data, and stores the speech corresponding to the output of the phase analyzer among the phase patterns belonging to the selected group. The data is output as the second speech recognition data which is the final recognition result.

스펙트럼 분석기는 아날로그 디지탈 변환기의 출력을 고속 푸리에 변환하는 고속 푸리에 변환기(FFT); 및 고속 푸리에 변환기의 출력에서 스펙트럼을 추출하여 스펙트럼 패턴을 출력하는 스펙트럼 추출기를 구비한다.The spectrum analyzer includes a fast Fourier transformer (FFT) for fast Fourier transforming the output of the analog digital converter; And a spectrum extractor for extracting the spectrum from the output of the fast Fourier transformer and outputting a spectral pattern.

위상 분석기는 아날로그 디지탈 변환기의 출력을 고속 푸리에 변환하는 고속 푸리에 변환기(FFT); 및 고속 푸리에 변환기의 출력으로부터 위상 정보를 추출하여 위상 패턴을 출력하는 위상 추출기를 구비한다.The phase analyzer includes a fast Fourier transformer (FFT) for fast Fourier transforming the output of the analog digital converter; And a phase extractor for extracting phase information from the output of the fast Fourier transformer and outputting a phase pattern.

제1 및 제2 패턴 매칭부는 은닉 마코브 모델 알고리즘(Hidden Markov Model Algorithm)을 수행하거나 다이내믹 시간 왜곡 알고리즘(Dynamic Time Warping Algorithm)을 수행한다.The first and second pattern matching units perform a Hidden Markov Model Algorithm or a Dynamic Time Warping Algorithm.

상기한 목적들을 달성하기 위하여 본 발명에 의한 음성 인식 방법은 샘플링된 아날로그 형태의 음성 신호를 디지탈 신호로 변환시키는 제1 단계; 디지탈로 변환된 음성 신호에 대하여 스펙트럼 분석을 수행하여 스펙트럼 패턴을 발생하는 제2 단계; 디지탈로 변환된 음성 신호에 대하여 위상 분석을 수행하여 위상 패턴을 발생하는 제3 단계; 서로 다르게 인식되어야 할 음성 신호들에 대한 스펙트럼 패턴들중에서 제2 단계에서 얻어진 스펙트럼 패턴과 일치하는 것에 대응되는 음성 데이타를 제1 음성 인식 데이타로서 발생하는 제4단계; 및 서로 다르게 인식되어야 할 음성 신호들에 대한 위상 패턴들중에서 어느 한 그룹을 제1 음성 인식 데이타에 따라 선택하고, 선택된 그룹에 속하는 위상 패턴들중 제3단계에서 발생된 위상 패턴과 일치하는 것에 대응되는 음성 데이타를 제2 음성 인식 데이타로서 발생하는 제5단계를 구비한다.In order to achieve the above objects, a speech recognition method according to the present invention includes a first step of converting a sampled analog type speech signal into a digital signal; A second step of performing spectral analysis on the digitally converted speech signal to generate a spectral pattern; A third step of performing a phase analysis on the digitally converted voice signal to generate a phase pattern; A fourth step of generating, as first speech recognition data, speech data corresponding to matching with the spectral pattern obtained in the second step among the spectral patterns for speech signals to be recognized differently; And selecting one group from among phase patterns for voice signals to be recognized differently according to the first voice recognition data, and matching the phase pattern generated in the third step among the phase patterns belonging to the selected group. And a fifth step of generating the voice data as second voice recognition data.

제2 단계는 디지탈 음성 신호를 고속 푸리에 변환하고 이에 대하여 스펙트럼을 추출하여 스펙트럼 패턴을 발생한다. 제3 단계는 디지탈 음성 신호를 고속 푸리에 변환하고 이에 대하여 위상 정보를 추출하여 위상 패턴을 발생한다.The second step is a fast Fourier transform of the digital speech signal and the spectrum is extracted against it to generate a spectral pattern. The third step is a fast Fourier transform of the digital speech signal and extracts phase information for it to generate a phase pattern.

제4 및 제5 단계는 은닉 마코브 모델 알고리즘(Hidden Markov Model Algorithm)에 의하여 이루어지거나, 다이내믹 시간 왜곡 알고리즘(Dynamic Time Warping Algorithm)에 의하여 수행된다.The fourth and fifth steps are performed by a Hidden Markov Model Algorithm, or by a Dynamic Time Warping Algorithm.

이어서, 첨부한 도면들을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명하기로 한다.Next, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 3은 본 발명에 의한 음성 인식 장치를 나타낸 블럭도로서, 아날로그 디지탈 컨버터(110), 스펙트럼 분석기(120), 위상 분석기(310), 스펙트럼형 템플리트/모델(130), 위상형 템플리트/모델(320), 패턴 매칭부(140) 및 패턴 매칭부(330)를 포함하여 구성되어 있다.3 is a block diagram illustrating a speech recognition apparatus according to the present invention, wherein an analog digital converter 110, a spectrum analyzer 120, a phase analyzer 310, a spectral template / model 130, a phased template / model ( 320, the pattern matching unit 140 and the pattern matching unit 330 are configured.

아날로그 디지탈 컨버터(110)는 샘플링된 음성 신호를 디지탈 신호로 변환하여 출력한다. 아날로그 디지탈 컨버터(110)로부터 출력되는 디지탈 음성 신호(DV)은 스펙트럼 분석기(120) 및 위상 분석기(310)로 인가된다. 스펙트럼 분석기(120)는 디지탈 음성 신호(DV)에 대하여 스펙트럼 분석을 수행하여 그 안에 포함되어 있는 스펙트럼 정보를 추출하여 그에 따라 스펙트럼 패턴(SI)을 출력한다.The analog digital converter 110 converts the sampled voice signal into a digital signal and outputs the digital signal. The digital voice signal DV output from the analog digital converter 110 is applied to the spectrum analyzer 120 and the phase analyzer 310. The spectrum analyzer 120 performs spectral analysis on the digital voice signal DV, extracts spectral information included therein, and outputs the spectral pattern SI accordingly.

위상 분석기(310)는 디지탈 음성 신호(DV)에 대하여 위상 분석을 수행하여 그 안에 포함되어 있는 위상 정보를 추출하여 그에 따른 위상 패턴(PI)을 출력한다. 위상 정보에는 개인차를 나타내는 성문, 즉 조음 기관이 사람마다 크기, 길이 등이 다르게 때문에 나타나는 특성과 특징적인 소리의 고유 위상을 나타내는 성문으로 구성된다. 따라서, 음성 인식 장치가 화자를 인식할 수 있도록 하는 경우에는 개인차를 나타내는 성문에 따른 위상 패턴을 발생하고, 음성 인식 장치가 단순한 음의 구별을 위한 것일 경우에는 음의 고유 위상 성문에 따른 위상 패턴을 발생하도록 한다.The phase analyzer 310 performs phase analysis on the digital voice signal DV, extracts phase information included therein, and outputs a phase pattern PI according thereto. The phase information is composed of a gate that represents individual differences, that is, a gate that represents the characteristic phase of the characteristic sound and characteristic sounds that appear because of different sizes and lengths of articulation organs. Therefore, when the speech recognition apparatus enables the speaker to recognize the speaker, a phase pattern is generated according to the voice letter indicating the individual difference, and when the speech recognition device is for the simple distinction of the sound, the phase pattern according to the intrinsic phase voice of the sound is generated. To occur.

스펙트럼형 템플리트/모델(130)에는 서로 다르게 인식되어야 할 각 음성 신호들에 대한 스펙트럼 패턴들이 저장되어 있다. 패턴 매칭부(140)는 스펙트럼형 템플리트/모델(130)에 저장되어 있는 스펙트럼 패턴들중에서 스펙트럼 분석기(120)로부터 출력되는 스펙트럼 패턴(SI)과 일치하는 스펙트럼 패턴을 찾아내어 그에 대응되는 음성 데이타를 제1 음성 인식 데이타로서 출력한다. 위상형 템플리트/모델(320)에는 서로 다르게 인식되어야 할 음성 신호들에 대한 위상 패턴들이 저장되어 있는데, 각 위상 패턴들은 동일 스펙트럼을 가지는 것들끼리 그룹별로 분류되어 저장되어 있다. 패턴 매칭부(330)는 패턴 매칭부(140)로부터 출력되는 제1 음성 인식 데이타에 근거하여 위상형 템플리트/모델(320)의 각 그룹중 어느 한 그룹을 선택한다. 그런 다음, 선택된 그룹에 속하는 위상 패턴들중에서 위상 분석기(310)로부터 출력되는 위상 패턴과 일치하는 것을 찾아내고 그에 대응되는 음성 데이타를 제2 음성 인식 데이타로서 출력된다. 여기서, 제2 음성 인식 데이타가 본 발명에 따른 음성 인식 장치의 인식 결과가 된다.The spectral template / model 130 stores spectral patterns for each voice signal to be recognized differently. The pattern matching unit 140 finds a spectral pattern that matches the spectral pattern (SI) output from the spectrum analyzer 120 among the spectral patterns stored in the spectral template / model 130 and searches the voice data corresponding thereto. Output as first speech recognition data. The phase type template / model 320 stores phase patterns for voice signals to be recognized differently, and each phase pattern is classified and stored in groups having the same spectrum. The pattern matching unit 330 selects one group from each group of the phased template / model 320 based on the first speech recognition data output from the pattern matching unit 140. Then, among the phase patterns belonging to the selected group, one matching with the phase pattern output from the phase analyzer 310 is found, and the voice data corresponding thereto is output as the second voice recognition data. Here, the second speech recognition data is the recognition result of the speech recognition apparatus according to the present invention.

스펙트럼형 템플리트/모델(130) 및 위상형 템플리트/모델(320)에 저장되는 스펙트럼 패턴 및 위상 패턴들은 일정 길이를 가지며 서로 다르게 인식되어야 하는 음성 신호들로부터 얻도록 한다. 또한, 화자 인식이 아닌 음의 인식률을 높이기 위한 경우에는 다양한 음성원으로부터 취득된 음성 신호를 분석하여 스펙트럼 패턴 및 위상 패턴들을 만들고 이를 평균하여 저장함으로써 화자에 개인차에 따른 위상 정보의 변동분을 고려할 필요가 있다.The spectral pattern and phase patterns stored in the spectral template / model 130 and the phased template / model 320 are obtained from speech signals that have a certain length and must be recognized differently. In addition, in order to increase the recognition rate of the sound other than speaker recognition, it is necessary to consider the variation of the phase information according to the individual difference by analyzing the voice signals acquired from various voice sources, making spectral patterns and phase patterns, and storing them as averages. have.

여기서, 위상 분석기(310)로부터 출력되는 위상 패턴(PI)이 화자 인식을 위한 개인차를 나타내는 성문에 관련된 정보만을 포함하고 있는 경우에는 패턴 매칭부(140)에서 스펙트럼 패턴을 이용하여 음성 인식을 수행하게 되고 패턴 매칭부(330)에서 위상 패턴에 따른 화자 인식을 수행하게 된다.Here, when the phase pattern PI output from the phase analyzer 310 includes only information related to a voiceprint indicating an individual difference for speaker recognition, the pattern matching unit 140 performs speech recognition using a spectral pattern. The pattern matching unit 330 performs speaker recognition according to the phase pattern.

한편, 위상 분석기(310)로부터 출력되는 위상 패턴(PI)이 음성 인식률을 높이기 위한 것으로 음의 고유의 위상 성문에 관련된 정보만을 포함하고 있는 경우에는 패턴 매칭부(140)에서 개략적인 음성 인식이 수행되고 패턴 매칭부(330)에서 보다 세밀한 음성 인식이 수행된다. 오와 우를 예를 들어 설명하면, 이들은 거의 비슷한 주파수 크기 형태를 가지는 것으로 같은 그룹으로 분류할 수 있다. 이 경우, 패턴 매칭부(140)에서는 이들이 속하는 그룹을 인식하게 되고, 패턴 매칭부(330)에서는 위상 분석기(310)로부터 발생되는 위상 패턴(PI)에 근거하여 오인지 우인지 등을 판별하게 된다. 또한, 스펙트럼형 템플리트/모델(130)에는 오와 우가 가지는 공통적인 스펙트럼 패턴과 오와 우가 속하는 그룹이 대응되어 저장되며, 위상형 템플리트/모델(320)에는 오에 대한 위상 패턴이 오라는 음성 데이타에 대응되어 저장되고, 우에 대한 위상 패턴이 우라는 음성 데이타에 대응되어 저장된다. 그리하여, 패턴 매칭부(330)에서는 위상 분석기(310)로부터 출력되는 위상 패턴이 오에 대응되는 위상 패턴과 일치하는 경우에는 오를 출력하게 되고, 우에 대응되는 위상 패턴과 일치하는 경우에는 우를 출력하게 된다. 도 5a 및 도 5b에서 도시된 바와 같이, 유사한 스펙트럼 정보를 가지는 음성 신호들을 동일한 그룹으로 분류하고, 대표적인 스펙트럼 패턴이 그룹 정보와 함께 스펙트럼형 템플리트/모델(130)에 저장되는 것이다. 도 5c는 동일 그룹에 속하는 음성 신호들이라도 그들의 위상 정보가 다름을 나타내는 그래프이다. 즉, 스펙트럼 패턴만으로는 충분히 식별할 수 없던 음성 신호들이라도 위상 패턴을 이용하는 경우에는 충분히 식별이 가능하게 됨을 알 수 있다.On the other hand, when the phase pattern PI output from the phase analyzer 310 is for enhancing the speech recognition rate and includes only information related to the intrinsic phase voice, the pattern matching unit 140 performs rough speech recognition. In the pattern matching unit 330, more detailed speech recognition is performed. For example, OH and WU can be classified into the same group as having almost similar frequency magnitude forms. In this case, the pattern matching unit 140 recognizes the groups to which they belong, and the pattern matching unit 330 determines whether the pattern matching unit 330 is false or right based on the phase pattern PI generated from the phase analyzer 310. . In addition, the spectral template / model 130 stores a common spectral pattern that O and U belong to and the group to which the O and U belong. Correspondingly stored, the phase pattern for the right is stored corresponding to the voice data called the right. Thus, the pattern matching unit 330 outputs an error if the phase pattern output from the phase analyzer 310 matches the phase pattern corresponding to the error, and outputs the right if the pattern matches the phase pattern corresponding to the right. do. As shown in FIGS. 5A and 5B, speech signals having similar spectral information are classified into the same group, and representative spectral patterns are stored in the spectral template / model 130 together with the group information. 5C is a graph showing that their phase information is different even for voice signals belonging to the same group. In other words, it can be seen that even in the case of speech signals that cannot be sufficiently identified only by the spectral pattern, when the phase pattern is used, the speech signal can be sufficiently identified.

다른 경우로는, 위상 분석기(310)에서 화자 인식을 위하여 개인차를 나타내는 성문 및 세밀한 음성 인식을 위한 음의 고유 위상 성문 등 2 이상의 위상 정보에 근거하여 위상 패턴을 발생하도록 할 수 있다. 이와 같은 경우에, 위상 패턴은 디지탈 데이타의 형태를 가지는 것으로 비트 포맷중 일정 비트들은 개인차 성문에 관한 정보를 나타내도록 하고 나머지 비트들은 음의 고유 위상 성문에 관한 정보를 나타내도록 한다. 이 경우, 그에 맞추어 위상형 템플리트/모델(320)을 구성하게 된다. 예를 들면, 위상형 템플리트/모델(320)은 동일 스펙트럼을 가지는 음성 신호들에 대한 위상 패턴이 동일 그룹으로 분류되어 저장되어 있는 부분과, 개인차를 나타내는 성문에 대한 위상 패턴이 저장되어 있는 부분으로 나누어질 수 있다. 좀 더 구체적으로 말하면, 위상 분석기(310)에서 발생되는 N비트의 위상 패턴중 K비트가 음의 고유 성문을 나타내는 정보이고 나머지 N-K 비트가 개인차를 나타내는 성문 정보인 경우에, 위상형 템플리트/모델(320)은 음성 신호들의 K비트 위상 패턴들을 동일 스펙트럼 패턴을 가지는 것들끼리 그룹별로 분류하여 저장을 저장하는 부분과, 개인차를 나타내는 성문에 따른 N-K비트의 위상 패턴을 저장하는 부분들로 구성될 수 있다. 이 경우, 패턴 매칭부(330)는 K비트의 위상 패턴, N-K비트의 위상 패턴에 대하여 그 알고리즘을 병렬로 수행하게 된다. 따라서, 인식 결과에는 인식된 음에 대한 정보뿐만 아니라 화자에 대한 정보도 포함되게 된다.In another case, the phase analyzer 310 may generate a phase pattern based on two or more phase information, such as a voiceprint indicating individual differences and a voice inherent phase voiceprint for fine speech recognition. In such a case, the phase pattern is in the form of digital data such that certain bits in the bit format indicate information about individual differences, and the remaining bits indicate information about negative intrinsic phases. In this case, the phased template / model 320 is configured accordingly. For example, the phase type template / model 320 is a portion in which the phase patterns of voice signals having the same spectrum are classified and stored in the same group, and the phase patterns of the gates representing individual differences are stored. Can be divided. More specifically, in the case where the K bits of the N-bit phase patterns generated by the phase analyzer 310 are information representing negative intrinsic voiceprints and the remaining NK bits are voiceprint information indicating individual differences, the phased template / model ( 320 may be configured to store the storage by classifying the K-bit phase patterns of the speech signals into groups having the same spectral pattern and storing the phase pattern of the NK bit according to the gates indicating individual differences. . In this case, the pattern matching unit 330 performs the algorithm on the K-bit phase pattern and the N-K bit phase pattern in parallel. Therefore, the recognition result includes not only information on the recognized sound but also information on the speaker.

스펙트럼형 템플리트/모델(130) 및 위상형 템플리트/모델(320)에 저장되는 스펙트럼 패턴들 및 위상 패턴들은 실질적으로 많은 사람들이 발음하도록 함으로써 취득된 음성 신호에 대하여 분석을 수행하여 스펙트럼 패턴 및 위상 패턴을 만들어 저장할 필요가 있다.The spectral patterns and the phase patterns stored in the spectral template / model 130 and the phased template / model 320 are analyzed for the speech signal obtained by making a large number of people pronounce the spectral pattern and phase pattern. You need to create and save it.

위상 패턴은 위상 정보 중 특정 주파수의 위상 값이 0이 되도록 표준화(normalization)를 수행하고, 다른 주파수는 표준화 계수를 적용하여 얻을 수 있다.The phase pattern may be normalized so that a phase value of a specific frequency becomes zero among phase information, and another frequency may be obtained by applying a normalization coefficient.

패턴 매칭부(140) 및 패턴 매칭부(330)는 각각 일치되는 스펙트럼 패턴 및 위상 패턴을 추출하는 것으로, 은닉 마코브 모델 알고리즘(Hidden Markov Model Algorithm)을 수행하거나 다이내믹 시간 왜곡 알고리즘(Dynamic Time Warping Algorithm)을 수행한다.The pattern matching unit 140 and the pattern matching unit 330 extract a matching spectral pattern and a phase pattern, respectively, and perform a Hidden Markov Model Algorithm or a dynamic time warping algorithm. ).

도 4는 도 3에 도시된 위상 분석기(310) 및 스펙트럼 분석기(120)의 구체적인 블럭도를 나타낸 것으로, 아날로그 디지탈 컨버터(110)로부터 출력되는 디지탈 음성 신호는 고속 푸리에 변환기(FFT; Fast Fourier Transform)(410)에서 변환된 후 각각 스펙트럼 정보 추출기(420) 및 위상 정보 추출기(430)로 인가된다.FIG. 4 is a detailed block diagram of the phase analyzer 310 and the spectrum analyzer 120 shown in FIG. 3. The digital voice signal output from the analog digital converter 110 is a fast Fourier transform (FFT). After conversion at 410, the signal is applied to the spectrum information extractor 420 and the phase information extractor 430, respectively.

본 발명에 의한 음성 인식 방법은 먼저, 샘플링된 아날로그 형태의 음성 신호를 디지탈 신호로 변환시킨다. 그런 다음, 디지탈로 변환된 음성 신호에 대하여 스펙트럼 분석 및 위상 분석을 수행하여 스펙트럼 패턴 및 위상 패턴을 발생시킨다. 이와 같이 발생된 스펙트럼 패턴 및 위상 패턴을 이용하여 인식을 수행하게 된다. 1차 인식은 스펙트럼 패턴을 이용하는데, 서로 구별되어야 할 음성 신호들에 대한 스펙트럼 패턴들이 미리 저장되어 있으며, 미리 저장되어 있는 스펙트럼 패턴들중 일치되는 것에 대응되는 음성 데이타를 제1 음성 인식 데이타로 발생한다. 그런 다음, 제1 음성 인식 데이타 및 위상 분석에 의하여 발생된 위상 패턴을 이용하여 2차 인식을 수행하게 된다. 2차 인식에서는 제1 음성 인식 데이타에 따라 그룹을 판별하고, 그 그룹에 속하는 위상 패턴들중 일치하는 것에 대응되는 음성 데이타를 제2 음성 인식 데이타로 출력하게 된다. 여기서, 제2 음성 인식 데이타가 최종 인식 결과가 된다. 스펙트럼 패턴 및 위상 패턴을 발생하는 단계에서는 디지탈 음성 신호를 고속 푸리에 변환하고 이에 대하여 스펙트럼 및 위상 정보를 추출하여 그에 따라 스펙트럼 패턴 및 위상 패턴을 발생한다. 1차 인식 및 2차 인식에서는 각각 은닉 마코브 모델 알고리즘(Hidden Markov Model Algorithm) 또는 다이내믹 시간 왜곡 알고리즘(Dynamic Time Warping Algorithm)을 사용하여 인식을 수행하도록 한다.The speech recognition method according to the present invention first converts a sampled analog type speech signal into a digital signal. Then, spectrum analysis and phase analysis are performed on the digitally converted speech signal to generate spectral patterns and phase patterns. Recognition is performed using the generated spectral pattern and phase pattern. The first recognition uses a spectral pattern, in which spectral patterns for voice signals to be distinguished from each other are stored in advance, and voice data corresponding to a match among prestored spectral patterns is generated as first speech recognition data. do. Then, the second recognition is performed by using the first speech recognition data and the phase pattern generated by the phase analysis. In the second recognition, a group is determined according to the first voice recognition data, and voice data corresponding to a match among phase patterns belonging to the group is output as second voice recognition data. Here, the second speech recognition data is the final recognition result. In the generating of the spectral pattern and the phase pattern, a fast Fourier transform of the digital speech signal is performed, and the spectral and phase information are extracted for the spectral pattern and the phase pattern. In the first and second recognitions, recognition is performed using a Hidden Markov Model Algorithm or a Dynamic Time Warping Algorithm, respectively.

본 발명은 상기 실시예에 한정되지 않으며, 많은 변형이 본 발명의 사상 내에서 당 분야에서 통상의 지식을 가진 자에 의하여 가능함은 물론이다.The present invention is not limited to the above embodiments, and many variations are possible by those skilled in the art within the spirit of the present invention.

이상에서 설명한 바와 같이 본 발명은 다양한 잡음이 발생하는 실제 환경에서 취득되는 자연음에 대해서도 명확하게 인식을 수행할 수 있다. 즉, 종래 기술에 따른 음성 인식 장치는 실험실과 같이 비교적 조용한 환경에서 명확하게 분리된 발음에 대하여만 음성 인식을 수행할 수 있고, 실제 인식 환경이 주는 잡음 상황, 자연스러운 발성 등을 극복하지 못해 인식률이 급격히 저하되어 실용화 단계로 접어들지 못한 것임에 반하여, 본 발명에 따른 음성 인식 장치 및 방법은 잡음 환경에서도 변별력을 가지며, 충분한 음가를 이루는 특성이 지속되지 않고 음과 음 사이에 천이 과정이 더욱 두드러져, 다양한 운율(prosody) 특성을 가지는 자연어에서도 인식이 원활히 수행될 수 있다. 또한, 필요한 경우에는 위상 정보를 이용함으로써 음성 인식과 병행하여 화자 인식도 수행할 수 있게 되어 이를 채용하는 시스템의 성능을 향상시키게 된다.As described above, the present invention can clearly recognize natural sound acquired in a real environment in which various noises are generated. That is, the speech recognition apparatus according to the related art can perform speech recognition only for clearly pronounced sounds in a relatively quiet environment such as a laboratory, and does not overcome the noise situation, natural utterance, etc. of the actual recognition environment, and thus the recognition rate is increased. On the contrary, the speech recognition device and method according to the present invention have a discriminating power even in a noisy environment, and the transition process between the sound and the sound is more prominent without the characteristic of achieving a sufficient sound value. Recognition can be performed smoothly even in natural language with various prosody characteristics. In addition, if necessary, by using the phase information, speaker recognition can be performed in parallel with speech recognition, thereby improving the performance of the system employing it.

Claims

An analog digital converter for converting a sampled analog type voice signal into a digital signal;

A spectrum analyzer configured to perform spectrum analysis on the output of the analog digital converter to output a spectrum pattern;

A phase analyzer configured to output a phase pattern by performing phase analysis on the output of the analog digital converter;

A spectral template / model that stores spectral patterns for speech signals to be recognized differently;

Phase patterns for voice signals to be recognized differently, each phase pattern being a phased template / model in which those having the same spectrum are divided and stored in a group;

A first pattern matching unit configured to compare the output of the spectrum analyzer with the spectral patterns stored in the spectral template / model and output voice data stored corresponding to the matching spectral pattern as first speech recognition data; And

Select one group of each group of the phase type template / model according to the first speech recognition data, compare the phase patterns belonging to the selected group with the output of the phase analyzer, and store the corresponding phase patterns And a second pattern matching section for outputting the voice data as the second voice recognition data.

The method of claim 1, wherein the spectrum analyzer

A fast Fourier transformer (FFT) for fast Fourier transforming the output of the analog digital converter; And

And a spectrum extractor for extracting a spectrum from an output of the fast Fourier transformer and outputting a spectral pattern.

The method of claim 1, wherein the phase analyzer

And a phase extractor for extracting phase information from the output of the fast Fourier transformer and outputting a phase pattern.

The apparatus of claim 1, wherein the first pattern matching unit performs a Hidden Markov Model Algorithm.

The speech recognition apparatus of claim 1, wherein the first pattern matching unit performs a dynamic time warping algorithm.

The apparatus of claim 1, wherein the second pattern matching unit performs a Hidden Markov Model Algorithm.

The apparatus of claim 1, wherein the second pattern matching unit performs a dynamic time warping algorithm.

A first step of converting the sampled analog form voice signal into a digital signal;

A second step of performing spectral analysis on the digitally converted speech signal to generate a spectral pattern;

A third step of performing a phase analysis on the digitally converted voice signal to generate a phase pattern;

A fourth step of generating, as first speech recognition data, speech data corresponding to matching with the spectral pattern obtained in the second step among the spectral patterns for speech signals to be recognized differently; And

Selecting one group of phase patterns for voice signals to be recognized differently and matching the phase pattern generated in the third step among the phase patterns belonging to the selected group And a fifth step of generating corresponding speech data as second speech recognition data.

The method of claim 8, wherein the second step

Fast Fourier transforming the digital speech signal; And

Extracting a spectrum from the fast Fourier transformed digital speech signal to generate a spectral pattern.

The method of claim 8, wherein the third step

Fast Fourier transforming the digital speech signal; And

Extracting phase information from the fast Fourier transformed digital speech signal to generate a phase pattern.

The speech recognition method of claim 8, wherein the fourth step is performed by a Hidden Markov Model Algorithm.

9. The speech recognition method of claim 8, wherein the fourth step is performed by a dynamic time warping algorithm.

9. The speech recognition method of claim 8, wherein the fifth step is performed by a Hidden Markov Model Algorithm.

The speech recognition method of claim 8, wherein the fifth step is performed by a dynamic time warping algorithm.