KR950001067B1

KR950001067B1 - Speech recognition device

Info

Publication number: KR950001067B1
Application number: KR1019940026432A
Authority: KR
Inventors: 죠지 카네; 아끼라 노하라
Original assignee: 마쯔시다덴기산교 가부시기가이샤; 다니이 아끼오
Priority date: 1990-05-28
Filing date: 1994-10-15
Publication date: 1995-02-08

Abstract

내용 없음.No content.

Description

Voice recognition device

제1도는 본 발명의 제1실시예에 의한 음성인식장치를 도시한 블록도.1 is a block diagram showing a voice recognition device according to a first embodiment of the present invention.

제2도는 본 발명의 제2실시예에 의한 음성인식장치를 도시한 블록도.2 is a block diagram showing a speech recognition apparatus according to a second embodiment of the present invention.

제3도는 일반적인 켑스트럼분석을 설명하기 위한 그래프.3 is a graph for explaining a general spectral analysis.

제4도는 본 발명에 있어서의 잡음예측을 설명하기 위한 그래프.4 is a graph for explaining noise prediction in the present invention.

제5도와 제6도는 본 발명의 잡음제거를 설명하기 위한 그래프.5 and 6 are graphs for explaining the noise cancellation of the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

401 : 대역분할수단 402 : 켑스트럼분석수단401: band division means 402: cepstrum analysis means

403 : 피크주파수검출수단 404 : 제1기억수단403: peak frequency detection means 404: first storage means

405 : 제1비교수단 406 : 음성커팅아웃수단405: first comparison means 406: voice cutting out means

407 : 특징추출수단 408 : 제2비교수단407: feature extraction means 408: second comparison means

409 : 제2기억수단 410 : 비교출력합성수단409: second storage means 410: comparative output synthesis means

411 : 피크검출수단 412 : 음성판별수단411: peak detection means 412: sound discrimination means

413 : 잡음예측수단 414 : 잡음제거수단413: noise prediction means 414: noise removal means

415 : 대역합성수단415: band synthesis means

본 발명은, 사람이 발성한 말을 인식하는 음성인식장치에 관한 것이다.The present invention relates to a speech recognition device for recognizing speech spoken by a person.

최근, 연구소등, 비밀을 요하는 장소에서, 특별한 말을 발생하지 않으면, 사용할 수 없는 시스템이 개발되어 있다. 이 시스템은 음성인식기술을 이용해서, 예를들면 {「일」,「삼」,「오」,「칠」}이라는 말을 키(key)로 해서, 이 말을 시스템으로 향해서 발성함으로써, 이 시스템이 가동하도록 되어 있다.In recent years, systems have been developed that cannot be used unless a special word is generated in a secret place such as a research institute. This system uses voice recognition technology, for example, the words {work, three, five, five, seven} as a key and utter this voice toward the system. The system is set to run.

그러나, 이와 같은 비밀을 요하는 시스템에 있어서는, 말을 키로할 뿐만 아니라, 말을 발성하는 사람도 특정인에 한정하는 것이 바람직하다. 즉, 미리 결정된 특정인이, 키가 되는 말을 발성함으로써, 시스템을 사용할 수 있도록 되는 쪽에서 비밀을 보다 지티기 쉽다고 할 수 있다.However, in a system requiring such a secret, it is preferable not only to make a horse a key but also to limit a person who speaks a word to a specific person. In other words, it can be said that it is easier for a predetermined person to hold a secret by uttering a key word, so that the system can be used.

본 발명의 목적은, 이와 같은 종래의 음성인식장치의 과제를 고려하여, 미리 결정된 사람이 발성한 음성만을 인식해서 출력하는 음성인식장치를 제공하는데 있다.SUMMARY OF THE INVENTION An object of the present invention is to provide a speech recognition apparatus that recognizes and outputs only speech produced by a predetermined person in consideration of the problems of the conventional speech recognition apparatus.

상기 목적을 달성하기 위한 본 발명에 따르면, 기지의 사람의 음성신호를 대역분할하고, 켑스트럼분석하여 얻은 피크주파수를 미리 기억하는 제1기억수단과, 미지의 잡음혼입의 음성입력신호를 대역분할하는 대역분할수단과, 대역분할수단의 출력을 켑스트럼분석하는 켑스트럼분석수단과, 켑스트럼분석수단의 켑스트럼분석출력으로부터 피크주파수를 검출하는 피크주파수검출수단과, 상기 피크주파수검출수단의 피크검출정보와 상기 제1기억수단에 기억된 각 피크주파수를 비교하고, 상기 미지의 음성신호가 누구의 음성신호인지를 판단하는 제1비교수단과, 각종 음성의 음성특징을 미리 기억하고 있는 제2기억수단과, 상기 잡음혼입의 음성신호로부터 음성부분을 커팅아웃하는 음성커팅아웃수단과, 커팅아웃된 음성의 특징을 추출하는 특징추출수단과, 특징추출수단에 의해서 추출된 음성특징과 상기 제2기억수단에 기억된 음성특징을 비교하여, 음성을 인식하는 제2비교수단과, 상기 제1비교수단의 비교결과와 제2비교수단의 비교결과에 의거하여, 소정의 처리를 행해서 출력하는 비교출력합성수단을 구비한 것을 특징으로 하는 음성인식장치가 제공된다.According to the present invention for achieving the above object, a first memory means for band-dividing a voice signal of a known person and pre-storing the peak frequency obtained by the spectral analysis, and the voice input signal of unknown noise in a band A band dividing means for dividing, a chop strm analysis means for performing cepstrum analysis on the output of the band dividing means, a peak frequency detecting means for detecting a peak frequency from the chop strum analysis output of the chop strm analysis means, and the peak First comparison means for comparing the peak detection information of the frequency detection means with each peak frequency stored in the first storage means, and determining whose voice signal the unknown voice signal is; A second memory means for storing, a voice cutting out means for cutting out a voice part from the noise mixed voice signal, and a feature extraction for extracting features of the cut out voice; However, the second comparison means for recognizing the voice by comparing the voice feature extracted by the feature extraction means and the voice feature stored in the second storage means, and the comparison result of the first comparison means and the second comparison means According to the comparison result, there is provided a speech recognition device comprising a comparison output synthesis means for performing a predetermined process and outputting the same.

본 발명의 상기 구성에 의한 음성인식장치는 음성신호로부터 음성을 발성한 사람을 켑스트럼분석을 이용해서 특정하고 사람의 정보도 이용하므로 비밀을 요하는 시스템 등에 유효하다. 또한, 잡음을 제거해서, 음성인식을 행하는 경우에는 잡음이 제거된 신호로부터 음성인식을 행할 수 있으므로 음성인식율이 한층 더 높아진다.The speech recognition apparatus according to the above-described configuration of the present invention is effective for a system requiring confidentiality since the person who has spoken the speech from the speech signal is identified by using spectral analysis and the information of the person is also used. In addition, when speech recognition is performed by removing noise, speech recognition can be performed from the signal from which the noise is removed, so that the speech recognition rate is further increased.

이하에 본 발명에 의한 각 실시예에 대해서 도면을 참조하면서 설명한다.EMBODIMENT OF THE INVENTION Below, each Example by this invention is described, referring drawings.

[실시예 1]Example 1

제1도는, 본 발명의 제1실시예에 의한 음성인식장치의 블록도이다. 제1도에 있어서 대역분할수단(401)은 신호에 대해서 대역분할을 행하는 수단으로서 예를들면 A/D변환기와 신호를 푸리에 변화하는 FFT수단을 구비하고 있다.1 is a block diagram of a speech recognition apparatus according to a first embodiment of the present invention. In FIG. 1, the band dividing means 401 includes means for performing band dividing on a signal, for example, an A / D converter and FFT means for Fourier changing a signal.

켑스트럼분석수단(402)은 켑스트럼분석을 행하는 수단이다.The cepstrum analysis means 402 is a means for performing cepstrum analysis.

제3도(a)는, 신호를 주파수분석하여 얻은 스펙트럼을 나타내고, 제3도는(b)는 켑스트럼(cepstrum)을 나타낸 그래프이다.FIG. 3 (a) shows the spectrum obtained by frequency analysis of the signal, and FIG. 3 (b) shows the cepstrum.

피크주파수검출수단(403)은, 켑스트럼분석수단(402)으로부터 얻은 켑스트럼 분포의 피크주파수를 검출하는 수단이다. 즉, 제3도(b)에 도시한 켑스트럼에 대한 피치(피크)가 피크주파수를 나타낸다.The peak frequency detection means 403 is a means for detecting the peak frequency of the cepstrum distribution obtained from the cepstrum analysis means 402. That is, the pitch (peak) with respect to the cepstrum shown in FIG. 3 (b) represents the peak frequency.

제1기억수단(404)은 기지의 사람의 음성신호를 대역분할하고 대역분할된 신호에 대해서 켑스트럼분석을 행하고 켑스트럼분석결과로부터 얻은 피크주파수를 미리 기억하고 있는 ROM 등으로 이루어진 제1기억수단이다.The first storage means 404 comprises a first ROM consisting of a ROM and the like, which band-specifies a voice signal of a known person, performs 켑 strum analysis on the band-divided signal, and memorizes the peak frequency obtained from the 켑 strum analysis result in advance. It is a memory means.

제1비교수단(405)는 피크주파수검출수단(403)의 피크검출정보와 제1기억수단(404)에 기억된 각 피크주파수를 비교하고 미지의 음성신호가 누구의 음성신호인지를 판단하는 수단이다. 소정의 허용범위내에서 가장 가까운 피크주파수의 소유자가 그 사람이라고 판단한다.The first comparison means 405 compares the peak detection information of the peak frequency detection means 403 with each peak frequency stored in the first storage means 404, and means for determining whose voice signal the unknown voice signal is. to be. It is determined that the owner of the closest peak frequency within the predetermined allowable range is the person.

한편, 제2기억수단(409)은, 각종 음성의 음성특성을 미리 기억하고 있는 ROM등으로 이루어진 수단이다. 예를들면, 단어마다 또는 「아」,「이」,「우」 등의 음절마다 또는 음소마다 음성의 특징을 기억하고 있다.On the other hand, the second storage means 409 is a means composed of a ROM or the like which previously stores voice characteristics of various voices. For example, the characteristics of the voice are memorized for each word or every syllable such as "ah", "yi", "right", or every phoneme.

음성커팅아웃수단(406)은 잡음혼입의 음성신호로부터 음성부분을 커팅아웃하는 수단이다. 예를들면, 필터 등을 사용한다.The voice cutting out means 406 is a means for cutting out the voice portion from the noise mixed voice signal. For example, use a filter.

특징추출수단(407)은 커팅아웃된 음성의 특징을 추출하는 수단이다.The feature extraction means 407 is a means for extracting the features of the cut out voice.

제2비교수단(408)은 특징추출수단(407)에 의해서 추출된 음성특징과 제2기억수단(409)에 기억된 음성특징을 비교하고, 음성을 확인하는 수단이다. 예를들면, 패턴매칭법들의 기법을 사용한다.The second comparing means 408 is a means for comparing the voice feature extracted by the feature extraction means 407 with the voice feature stored in the second memory means 409 and confirming the voice. For example, we use the technique of pattern matching methods.

비교출력합성수단(410)은 상기 제1비교수단(405)의 비교결과와 제2비교수단(408)의 비교결과에 의거하여 소정의 처리를 행하여 그 결과를 출력하는 수단이고, 예를들면, 미리 지정된 사람이외의 사람의 경우에는 인식결과를 출력하지 않도록 한다. 또는, 누구의 음성인지에 대한 정보를 부여해서 인식결과를 출력해도 된다.The comparison output synthesizing means 410 is a means for performing a predetermined process based on the comparison result of the first comparison means 405 and the comparison result of the second comparison means 408 and outputting the result. In the case of a person other than a predetermined person, do not output the recognition result. Alternatively, the recognition result may be outputted by giving information on whose voice it is.

다음에 상기 실시예의 동작을 설명한다.Next, the operation of the above embodiment will be described.

먼저, 제1기억수단(404)에 기지의 사람의 음성신호를 대역분할하여 켑스트럼분석하고, 켑스트럼분석출력으로부터 얻는 피크주파수를 미리 기억해 둔다. 그리고, 제2기억수단(409)에 각종 음성의 음성특징을 미리 기억해 둔다.First, the first storage means 404 band-specifies the voice signal of a known person, performs chop-strum analysis, and memorizes in advance the peak frequency obtained from the chop-strum analysis output. The second memory means 409 stores in advance the voice features of the various voices.

대역분할수단(401)에서 미지의 잡음혼입의 음성입력신호를 대역분할하고 켑스트럼분석수단(402)에서 대역분할수단(401)의 출력을 켑스트럼분석하고 피크주파수검출수단(403)에서 켑스트럼분석수단(402)의 켑스트럼분석출력으로부터 피크주파수를 검출하고 제1비교수단(405)에서 상기 피크주파수검출수단(403)의 피크검출 정보와 상기 제1기억수단(404)에 기억된 각 피크주파수를 비교하고 미지의 음성신호가 누구의 음성신호인지를 판단한다.In the band dividing means 401, the voice input signal of unknown noise mixing is band-divided, and the band strut analysis means 402 performs the spectral analysis on the output of the band dividing means 401, and in the peak frequency detecting means 403. The peak frequency is detected from the shock analysis output of the shock analysis means 402, and the first detection means 405 detects the peak frequency information of the peak frequency detection means 403 and the first storage means 404. Compare each stored peak frequency and determine whose voice signal the unknown voice signal is.

음성 커팅아웃수단(406)에서 상기 잡음혼입의 음성신호로부터 음성부분을 커팅아웃하고 특징추출수단(407)에서 커팅아웃된 음성의 특징을 추출하고 제2비교수단(408)에서 특징추출수단(407)에 의해서 추출된 음성특징과 상기 제2기억수단(409)에 기억된 음성특징을 비교하여 음성으로 인식한다.In the voice cutting out means 406, the voice part is cut out from the voice signal of the noise mixing, the feature of the voice cut out in the feature extraction means 407 is extracted, and the feature extraction means 407 in the second comparison means 408. Speech feature extracted by the < RTI ID = 0.0 >) < / RTI > and the voice feature stored in the second storage means 409 are compared and recognized as speech.

비교출력합성수단(410)에서 상기 제1비교수단(405)의 비교결과와 제2비교수단(408)의 비교결과에 의거하여 소정의 사람음성신호의 경우에만 음성인식결과를 출력한다.On the basis of the comparison result of the first comparison means 405 and the second comparison means 408, the comparison output synthesizing means 410 outputs the voice recognition result only in the case of a predetermined human voice signal.

[실시예 2]Example 2

제2도는 본 발명에 의한 제2실시예이고, 대역분할수단(401), 켑스트럼분석수단(402), 제1기억수단(404), 제1비교수단(405), 음성커팅아웃수단(406), 특징추출수단(407), 제2비교수단(408), 제2기억수단(409), 비교출력합성수단(410)은 제1도에 도시한 것과 마찬가지이므로 이에 대한 설명은 생략한다.2 is a second embodiment according to the present invention, and includes band dividing means 401, cepstrum analysis means 402, first memory means 404, first comparison means 405, and voice cutting out means ( 406, the feature extraction means 407, the second comparison means 408, the second memory means 409, and the comparison output synthesizing means 410 are the same as those shown in FIG.

피크검출수단(411)은 켑스트럼분석수단(402)에 의한 켑스트럼분석결과를 이용해서 피크를 검출하는 수단이고 피크주파수검출수단(403)은 피크정보에 의거하여 피크주파수를 검출하는 수단이다. 한편, 음성 판별수단(412)은 피크정보에 의거하여 음성부분을 판별하는 수단이다. 잡음예측수단(413)은 음성판별수단(412)에 의한 음성부분정보를 이용해서, 상기 대역분할된 신호에 대해서 음성부분의 잡음을 예측하는 수단이다. 이 잡음예측수단(413)은 m개의 채널로 분할된 음성/잡음입력에 의거하여, 잡음성분을 각 채널마다 예측하는 수단이다. 예를들면, 제4도에 표시한 바와 같이, X축에 주파수, Y축에 음성레벨, Z축에 시간을 취하는 동시에 주파수(f1) 데이터(P₁),(P₂), ... ,(Pi)를 취하고, 다음의 데이터(Pj)를 예측한다. 예를들면, 잡음부분(P₁)∼(Pi)의 평균을 취하여 데이터(Pj)로 한다. 또는 음성신호부분이 다시 계속될 때에는 데이터(Pj)에 감쇠계수를 곱한다. 잡음제거수단(414)은 상기 대역분할된 신호로부터 예측된 잡음을 제거하는 수단이다.The peak detecting means 411 is a means for detecting a peak using the result of the chop strum analysis by the cepstrum analysis means 402, and the peak frequency detecting means 403 is means for detecting the peak frequency based on the peak information. to be. On the other hand, the speech discriminating means 412 is means for discriminating the speech part based on the peak information. The noise predicting means 413 is a means for predicting the noise of the speech portion with respect to the band-divided signal by using the speech portion information by the speech discriminating means 412. The noise predicting means 413 is a means for predicting a noise component for each channel based on the speech / noise input divided into m channels. For example, the fourth, as illustrated in Figure, the audio level of the frequency, Y-axis in the X-axis, taken at the same time the frequency (f1) to the Z-axis data, the time _{_{(P 1), (P 2}} ), ..., Take Pi and predict the next data Pj. For example, the noise portions P ₁ to Pi are averaged to be data Pj. Alternatively, when the audio signal portion continues again, the data Pj is multiplied by the attenuation coefficient. The noise removing means 414 is a means for removing the predicted noise from the band split signal.

일반적으로 잡음제거의 방법의 일례로서 시간축에 대한 잡음제거는, 제5도에 표시한 바와 같이 잡음혼입의 음성신호(a)로부터 예측된 잡음파형(b)을 감산함으로써 행해진다. 이에 의해서 음성신호(c)만이 인출된다. 또한, 본 실시예에서는 제6도에 표시한 바와 같이, 주파수를 기준으로 한 잡음제거에 의해서 잡음혼입의 음성신호(a)를 푸리에 변환하고, 푸리에 변환된 스펙트럼(b)으로부터 예측잡음의 스펙트럼(c)를 감산하고, 감산된 스펙트럼(d)을 푸리에 역변환해서 잡음이 없는 음성신호(e)를 얻는다. 대역합성수단(415)은 잡음제거수단(414)으로부터 공급되는 m개의 채널의 신호를 푸리에 역변환해서 음성출력을 얻는 수단이다. 상기 음성커팅아웃수단(406), 대역합성수단(415)에 의해서 합성된 신호로부터 음성커팅아웃을 행한다.In general, as an example of a method of noise reduction, noise reduction on the time axis is performed by subtracting the predicted noise waveform b from the speech signal a of noise mixing, as shown in FIG. As a result, only the audio signal c is extracted. In this embodiment, as shown in FIG. 6, Fourier-converting voice signal a of noise mixing is Fourier-transformed by noise cancellation on the basis of frequency, and a spectrum of predicted noise is obtained from Fourier-transformed spectrum b. c) is subtracted and the subtracted spectrum d is Fourier inversely transformed to obtain a noise-free speech signal e. The band synthesizing means 415 is a means for Fourier inverse transforming the signals of the m channels supplied from the noise removing means 414 to obtain a voice output. Voice cutting out is performed from the signal synthesized by the voice cutting out means 406 and the band synthesizing means 415.

본 실시예에서는, 이와 같이, 음성을 인식하는 대상이 되는 음성신호로부터 잡음을 제거하고 있으므로 보다 높은 인식율을 가지는 음성인식이 행하여진다.In this embodiment, since the noise is removed from the voice signal to be recognized, the voice recognition with higher recognition rate is performed.

[다른 실시예]Other Examples

또한, 이상의 실시예에 있어서, 켑스트럼분석수단, 피크주파수검출수단, 비교수단 등의 각종 수단은 컴퓨터를 이용해서 소프트웨어적으로 실현할 수 있으나 전용의 하드회로를 사용해도 실현가능하다.In addition, in the above embodiments, various means such as the cepstrum analysis means, the peak frequency detection means, the comparison means and the like can be realized in software using a computer, but a dedicated hard circuit can also be realized.

또한 본 발명에 있어서 잡음이란, 착안하는 신호이외의 신호를 의미하므로 음성에 대해서도 잡으로 취급되는 일이 있을 수 있다.In addition, in the present invention, noise means a signal other than the signal of interest, and therefore may be treated as a job for voice.

상기한 실시예로부터 명백한 바와 같이, 본 발명에 의한 음성인식장치는 음성신호로부터 음성을 발성한 사람을 켑스트럼분석을 이용하여 특정하고, 그 사람이 정보도 이용하므로 비밀을 요하는 시스템 등에 유효하다.As is evident from the above embodiment, the speech recognition apparatus according to the present invention is effective for identifying a person who has spoken a speech signal from a speech signal by using Cepstrum analysis and requiring a secret because the person also uses information. Do.

또한 잡음을 제거해서, 음성인식을 행하는 경우에는 잡음이 제거된 신호로부터 음성인식을 행할 수 있으므로 음성인식율이 한층 더 높아진다.In addition, when speech recognition is performed by removing noise, speech recognition can be performed from the signal from which the noise is removed, so that the speech recognition rate is further increased.

Claims

A first storage means for pre-storing the peak frequency obtained by band-splitting the known human voice signal and performing a spectral analysis; band-spliting means for band-splitting an unknown noise-incorporated voice input signal; and outputting the band-spliting means. A peak frequency detection means for detecting a peak frequency from the shock analysis output of the shock strut means, the peak detection information for the peak frequency detection means, and the first memory; A first comparison means for comparing each peak frequency stored in the means and determining which person's voice signal is the voice signal, a second memory means for storing the voice characteristics of various voices in advance, and the noise mixing A voice cutting out means for cutting out the voice portion from the voice signal of the voice signal, a feature extraction means for extracting features of the cut-out voice, and a voice feature extracted by the feature extraction means; A predetermined process is performed on the basis of the comparison result of the second comparison means and the second comparison means for recognizing the speech by comparing the voice feature stored in the second storage means with the comparison result of the first comparison means and the second comparison means. Speech recognition device comprising a comparison output synthesizing means.

2. The speech discrimination apparatus according to claim 1, wherein the speech discriminating means for discriminating the speech portion using the results of the spectral analysis by the spectral analysis means, and the speech from the band-split signal using the speech portion information by the speech discriminating means. And noise prediction means for predicting the noise of the portion, noise removing means for removing predicted noise from the band-divided signal, and band combining means for band-synthesizing the output of the noise removing means. The out means performs a cutout from the signal synthesized by the band synthesizing means.